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ABSTRACT 


We want to estimate the vector of multinomial cell probabilities p 
from incomplete data, incomplete in that it contains partially classified 
observations. Each such partially classified observation is observed to 
fall in one of two or more selected categories but is not classified fur- 
ther into a single category. The data is assumed to be incomplete at 
random. The estimation criterion is minimization of risk for quadratic 
loss. The estimators are the classical maximum likelihood estimate, the 
Bayesian posterior mode, and the posterior mean. An approximation we 
develop is used for the posterior mean. The Dirichlet, the conjugate 
prior for the multinomial distribution, is assumed for the prior distri- 
bution. 

We show these three estimators to be approximately equal in large 
samples. We then study risk in small- and medium-size samples through 
Monte-Carlo simulation studies for the trinomial distribution. Samples 
are of size 25 and 50, percentage of incomplete data varies around 15 
and 40, and probabilities range from the center of the probability sim- 
plex P^ to one of its corners. Probabilities equal the means of the 
prior distributions for varying prior parameters or are randomly gen- 
erated from these distributions. Priors used in the Bayesian estimators 
are the correct prior, a uniform prior, and a perturbed prior. The EM 
iterative algorithm of Dempster, Laird, and Rubin (1977) is used to eval- 
uate all three estimators. 

Results indicated that the relationship between the probability p 
being estimated and the prior parameters 3 used in the Bayesian estima- 
tors was one of the most important factors in determining which estima- 
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tor was preferable. If the mean p of the Dirichlet distribution given 
the prior parameters 8 was within a fairly wide range of p, then the pos- 
terior mean was the best estimator of p. If the mean was far from p, 
then the maximum likelihood estimate was best. Between these extremes 
was a region in which the posterior mode was often best when p was toward 
a corner of P^. The maximum likelihood estimate and posterior mode were 
equally best at a corner. When the best estimator was used, risk was 
usually reduced by one-fourth to one-third over that of the next best 
estimator and by one-third to one-half over that of the worst estimator. 
However, the reduction in risk was sometimes substantial. The largest 
reduction occurred at the corner p=(0,0,l); the risk of the posterior 

mean was as much as 33,000 times larger than the risk of the posterior 

mode or maximum likelihood estimate. 

As the percentage of incomplete data increased, the risk of the 
three estimators did not greatly increase and the relationship among 
the estimators changed little. As sample size increased, risk and the 
difference in risk between estimators usually decreased. 

Because numerical evaluation of the exact posterior central moments 
is generally unfeasible, we also develop approximations for elements of 
the posterior mean and covariance matrices. The best of three approxi- 
mations considered for the posterior mean is based on a first-order 

Taylor-series expansion of the exact posterior mean that has accuracy of 
order 0(n"*). Because terms in the expansion are then approximated, the 
final approximation, called the Taylor-series approximate posterior mean, 
is not necessarily accurate to order 0(n ^). However, we show that this 
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approximation asymptotically equals the exact posterior mean. Further, 
we give two conditions which guarantee that the error between the exact 
posterior mean and an iterative solution of the Taylor-series approxi- 
mate posterior mean is of magnitude 0(n ^). 

Approximations used for elements of the posterior covariance matrix 

-3/2 

are based on Taylor-series expansions accurate to order 0(n ). When 

the iterative solution for the Taylor-series approximate posterior mean 
has accuracy of magnitude 0(n"'*'), then the Taylor-series approximate pos- 
terior variance and covariance can be evaluated noniteratively to have 

-3/2 

accuracy of magnitude 0{n ). These approximations can also be eval- 

-3/2 

uated iteratively. However, insurance of accuracy of magnitude 0(n ) 

then depends on satisfaction of the two conditions discussed for itera- 
tive solution of the Taylor-series approximate posterior mean. 

An important. property of the Taylor-series approximations is that, 
as the percentage of incomplete data goes to zero, they go to the exact 
posterior moments. In addition, the relationship between the Taylor- 
series approximate posterior mean and the posterior mode parallels their 
complete-data relationship. 

In the same Monte-Carlo simulation study used for the risk study, 
the Taylor-series approximation for the posterior mean was usually accu- 
rate to at least four significant figures; that for the posterior vari- 
ance, to at least three significant figures; and that for the posterior 
covariance, to at least two significant figures. In practice, the Tay- 
lor-series approximations will generally be more accurate than numerical 
evaluation of the corresponding exact posterior moments. 
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CHAPTER 1 
INTRODUCTION 


1.1 Overview : 

This thesis is concerned with simultaneous estimation of the vector 
p of cell probabilities from incomplete multinomial data where the criterion 
of goodness is minimization of risk for quadratic loss. As is well known, 
the posterior mean will minimize expected risk. However, complete-data 
results indicate that for at least boundary probabilities, the maximum 
likelihood estimate might be a better estimator. Hence, we study both 
estimators for specified values of p. In addition, we investigate a third 
estimator, the posterior mode, which has some advantages of each of the 
other two estimators. 

Because numerical evaluation is generally unfeasible, we also develop 
approximations for the posterior mean and covariance matrices. Therefore, 
part of this thesis concerns derivation of the approximations and proof of 
their accuracy. 

In the next section, we define the risk problem and detail reasons for 
choosing the posterior mean, maximum likelihood estimator, and posterior 
mode. We begin by defining special notation for the incomplete-data pro- 
blem. We also outline a robustness study concerning use of the correct 
prior in the Bayesian estimators. In the third section, we review the 
literature of estimation from incomplete multinomial data. 

Chapter 2 describes the estimators. First we derive the exact posterior 
mean and central moments and illustrate the problems in their numerical 
computation. Then we give derivations for the mode estimators, the maximum 
likelihood estimate and posterior mode. In Chapter 3, we develop truncated 
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Taylor-series approximations for the exact posterior mean and covariance 
matrices. In Chapter 4 we prove the asymptotic, large-sample, accuracy 
of these approximations. For these large samples, the posterior mean, 
maximum likelihood estimate, and posterior mode are all approximately 
equal; hence, there will be little difference in their risks. 

We then turn to small -sample behavior of the estimators. For small - 
and medium-size samples, we investigate (1) the accuracy of the Taylor- 
series approximations for the posterior mean and covariance matrices, (2) 
which of the Taylor-series approximation, maximum likelihood estimate, and 
posterior mode best approximates the posterior mean, (3) which estimator 
best minimizes risk for quadratic loss at specified values of p, and (4) 
how robust results in (3) are to use of the correct prior in the Bayesian 
estimators. Because we could not answer these questions analytically, we 
performed Monte-Carlo simulation studies for the trinomial distribution. 

In Chapter 5 we discuss the design and relevant computational procedures 
for two such studies. Chapters 6 and 7 give results of these two studies 
and guidelines for practical implementation of the results. 

In Chapter 8 we summarize the main research of the thesis, draw con- 
clusions, and recommend areas for future study. 
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1.2 Problem Statement: 


Assume that we have a k-dimensional Dirichlet prior 


k+1 k+1 k+1 v.-l 

g(p|v) = [r( e V )/ n r(v )] n p 1 , (l.i) 

i=l 1 i=l 1 i=l 1 


where v.>0 and p takes values in the k-dimensional probability simplex 
1 ~ k+1 

Pk={(Pi,...,P k+ i) : Pi*0, E p,.=l}. The Dirichlet density is the conjugate 

prior for the multinomial distribution. Assume also that we have complete 

k+1 

data x=(x,,...,x .), n= E x.. denoting nonnegative integer sample values 

- 1 k 1 i=l 1 

of the random vector X=(X^, . . having the k-dimensional multinomial 
distribution M(n;p) with density 

k+1 k+1 x 

h(x|p) = [n!/ n X.!] n p \ (1.2) 

i=l 1 i=l 1 


Thus, the k+1 components of x respectively denote the number of the n 
observations that fall in k+1 mutually exclusive categories Cp...,C k+ ^. 

Suppose, however, that n observations are made on k+1 mutually exclusive 
categories but that some of these observations are only partially observed 
in that each of these observations falls in one of two or more of the k+1 
categories but cannot further be classified into a single category. That is, 
for some of the n observations one knows only that the observation falls in 
one of 1 particular categories for 1-1 -k+1 but not which one of these 1 
categories. This. set of categories among which an observation is shared is 
called a pattern of incomplete data. 

We denote such a set of categories as C suffixed by the indices of the 
sharing categories. For example, if an observation is known to fall in one 
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of categories C., C^, or Cp for l-i,j,l-k+l, but cannot be specified 

further, we write that the observation falls in C..,. More commonly, we 

i j 

write the total of all such observations falling in C.., as z.., or z r . . , , 

3 t J * ijl {i,J,l} 

where, following a few more comments, we elaborate on these two z subscript 

notations. Corresponding to the use of x=(xpX 2 x k+l^’ we write 

z=(z 1 ,z 2 ,...,z 12 ,z 13 ,...,z 12 ^ k ) [or (z {1} ,z {2} ,...,z {1>2} ,z {1 ^ 3} ,..., 
z^ 2 ^ ) ] to denote the vector of incomplete data. Thus, z=(zj,z 2 ,z 3 , 

z 12* Z 13 * Z 23 ^ re P resents t * ie vec tor of incomplete trinomial data having, for 
example, z 2 completely specified observations falling in C 2 and z^ 3 incompletely 
specified observations such that each observation is known to fall in one of 
Cj or C 3 (C^) but is not specified further. 

However, we need some way to abbreviate notation for summing and multi- 
plying over all collections containing a particular integer in forthcoming 
equations. The least cumbersome approach is to adopt set notation and then, 
for convenience and to parallel complete-data notation (i.e., complete-data 
notation is Xp not x^), drop braces and commas where possible. Therefore, 
in the next few paragraphs, we formally define the set notation used. 

We first note that we want the notation to allow for dividing the data 
into separate multinomial groups in the Hocking and Oxspring manner to be 
described in the next section. Although we observe data in the general, 
unrestricted, form Zp Zp . . .,z^ 2 , . . .,z^ 2 where the completely specified 
data Zp z 2 ,.. .,z ^ need not be subdivided, we use the Hocking and Oxspring 
restrictive form in writing the likelihood for the exact posterior central 
moments in Chapter 2 and for some of the asymptotic proofs in Chapter 4. Thus, 
for each incomplete-data pattern, we create notation to allow for enough 
artificial completely specified observations to complete a multinomial group. 
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For example, if we observe z^, z 2 , z 3 , z^ 2 , and z 13 , we can treat the 
data in the Hocking and Oxspring manner as coming from three independent 


distributions, one trinomial and two binomials as, follows: Vj=z^, v 2 , 

v 3* y 12 =z 12 ,y 3* and w 13 =z 13* w 2 ; where v 2 +w 2~ z 2 and v 3 +y 3 =z 3* Her ®* 
Vj, v 2 , and v 3 have a trinomial distribution with probabilities p^» p 2 , 

and p 3 ; y 12 and y 3 have a binomial distribution with probabilities 


(p x +p 2 ) and p 3 ; 


and Wj 3 and w 2 have a binomial distribution with 


probabilities ( P ^ + P 3 ) and P 2 . 


Therefore, for k the dimension of a multinomial distribution, let 55 


be a nonempty subset of {1,2,... ,k+l} and let P be the set of mutually 
exclusive and exhaustive subsets $. For example, for the trinomial dis 


tribution we could have the following P and %: 

P x = {{1} ,{2},{3}} containing 2 lfl = {l}, 3 2 ,i={ 2}, and 3 3>1 =(3}; 

P 2 = {{1,2}, {3}} containing $ 1>2 = U,2} and 2 2>2 ={3}; 

P 3 = {{1,3} {2}} containing 3 ={1 ,3} and $ 2 3 ={2}; and 

P 4 = {{1} ,{2,3}} containing S 1>4 ={1} and $ 2 ^ 4 ={2,3}. 

Define 55, P to be the set element % in the set P. Suppose that there 


are 3^ p elements in 55, P. Let z^ p be the number of observations such 
that each observation falls in one of the 3^ p categories for ie£, but 
is not further classified into a particular one of these 3g p categories 
if 3g p >l. Incomplete multinomial data is data of the form z^ p for $ 
containing more than one element; i.e., Sg p >l. 

Thus, for the example given in the third preceding paragraph, we have 


that 5?,P 1 =(Z {1},{{1} ,{2},{3}}’ Z {2},{{1>,{2},{3}} 

=(v 1 ,v 2 ,v 3 ), Z3 ,p 2 = (2 {1>2 },{{1,2},{3}}» z { 3},{{1,2},{3}} ) = and 
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?2,P 3 ~ (z {1,3},{{1,3},{2}}» z {2},{{1,3},{2}} ) = 

We note that while we are deriving the posterior distribution of 

p in Chapter 2 or calculating its limit in Chapter 4, we will use the P 

subscript. For all other purposes, however, we discard the P subscript 

and work with only the sufficient statistics of the Hocking and Oxspring 

observed data, defined by 

Zn — Y, 2 op* (1*3) 

Thus, in our trinomial example the sufficient statistics are 

Z {1} =Z {1},{{1) ,{2},{3}}’ z {2} =z {2} ,{{1} ,{2} ,{3}} +z {2} ,{{1 ,3} ,{2}} ’ 

Z {3} =Z {3} ,{{1} ,{2},{3}} +Z {3},{{l,2} ,{3}} ’ z { 1 ,2 } =2 { 1 ,2 > , { { 1 ,2 } , { 3 > > » and 
Z {1 ,3} =Z {1 ,3} ,{{1 ,3},{2}} 

We let z denote the vector of all z^. Therefore, as in our earlier 

discussion, z is our vector of observed data. Similarly, n=£ z a denotes 

% * 

the sum of all the observed data. Finally, we define p^ as the sum of 
probabilities p.. for i in £. Thus, p^ 3 j=p 3 and P { 3 5 6} =p 3 +p 5 +p 6* 

In summary, we use set notation because it is the least cumbersome 
mechanism for writing sums and products over all sets (or collections) 
containing a particular integer. The use of set notation also aids 
derivations of exact posterior central moments in Chapter 2 and calcu- 
lation of limits in Chapter 4. On the other-hand, where possible we 
delete the braces and commas to simplify equations and to parallel 
complete-data notation ( i . e . , complete-data notation is x^ , not x^.j). 
For example, we usually write p^ instead of p^ 2 }- We a ^ s0 mix 
simplified and full notations. For example, we usually write 
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z.+ Z z n p./p n rather than 2 ,.,+ Z ZnP/^/Pn where Z means the sum 
1 D9i 1 u m Dai lli D?i 

over all those multiple-integer sets that contain i. Thus, in the 

trinomial example, Z means the sum over all sets {1,2} and {1,3} that 

Dal 

contain the integer 1. Note that we define D as a set containing more 
than one integer unless otherwise specified. That is, D can not denote 
the set {i} for any i. 

Finally, we assume that the incompleteness of the data is random. 
That is [see Rubin (1976)], incomplete data is not a function of the 
values that would have been observed. 

In this thesis we are interested in minimizing risk. Risk is def- 
ined as expected loss with respect to, in this work, the distribution of 
z given p; that is, for some estimator p of p, 

r(p»p) = E[L(p,p) ] = Z L(p,p) h(z|p) , (1.4) 

Z k 

where r(p,p) is the risk of p, L(p,p) is the loss function for p, 

Z|^={(z 1 z k+l’ z 12* z 13” ” ,z 12 k^ : eac * 1 z com P onent 1S a nonnegative 

integer and the z components sum to n}, and h(z|p) is the density of z 
given p. 

In (1.4), the risk function depends on the value of the generally 
unknown probability p. As Zellner (1971, p25) points out, it is 
impossible to find an estimator p that minimizes risk r(p,p) for all 
possible values of p. He gives as an example that the vector p=b of 
constants will have minimum risk when p=b; hence, as p varies over P k , 
the minimizing estimator varies. 

Therefore, a common practice is to choose as an estimator that one 
that minimizes the average risk E[r(p,p)], where 
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E[r(p,p)] = / r(p,p) g(p) dp 
p k " ' ~ 

= I [£ L(p,p) h(z |p) ] g(p) dp (1.5) 

P k Z k 

= Z [/ L(p,p) f ( p | 2 ) dp] q(z) 

Z k P k 

for g(p) the prior density of p, f(p|z) the posterior density of p given 
z, and q(z) the marginal density of z. 

Now, the estimator minimizing the term in brackets in the last line 
of (1.5) also minimizes expected risk. For quadratic loss 


L(P,P) = (P-P)'(P-P) 


k+1 , 

= Z ( Pi -P ,) 2 , 
i=l 1 1 


(1.6) 


this Bayes estimator is the posterior mean. We use quadratic loss (also 
called mean squared error) for the loss function because of its mathematical 
tractabi 1 ity, frequent past usage, accuracy in approximating other loss 
functions [see Mood and Graybill (1963, pl65) and DeGroot (1970, p227 ) ] , and 
physical interpretation. The emphasis in quadratic loss is on minimization 
of the overall scatter of the estimates from the true value rather than 
concentration on a few extreme departures. In particular, the quadratic- 
loss criterion allows bias in an estimator if the variance is compensatingly 
smal 1 . 

As noted just before (1.5), however, the posterior mean will not 
minimize risk in (1.4) for all values of p. Hence, there might be ranges 
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of p for which other commonly used, and easily calculated, estimators 
improve on the posterior mean. Further, as Zellner ( 1971 ,p26) notes, 
many sampling theorists object to use of the prior density g(p) (because 
it is never known in practice). Thus, they do not consider the minimal 
average risk property of the posterior mean to be important. 

Therefore, besides the posterior mean p, we also investigate two 
other estimators to minimize risk for at least some values of p. The 
first estimator is the maximum likelihood estimate p. We include it 
because it is a classical estimator that is often used. In particular, 
it is frequently used when one has no prior knowledge. For complete 
data, the maximum likelihood estimator p=x/n is the unique, minimum 
variance unbiased estimate of p. Hence, any estimator having smaller 
risk than p must be biased. However, Johnson (1971) has shown that p 
is admissible. That is, there does not exist any other estimator p 
having at least as small a risk for all values of p and strictly smaller 
risk for at least one value of p. 

The maximum likelihood estimate p is admissible because no other 

estimators have smaller risk when all but one of the p components are 

k+1 2 

near zero. Since the risk of p equals 1- Z p. , the risk is close to 

i=l 1 

zero when p is near a corner of the simplex. Hence, if the 
incomplete-data case parallels the complete-data case, we would expect 
the maximum likelihood estimate p to have smallest risk when all but one 
of the p components are near zero and the posterior mean to have smallest 
risk furthest from the boundary; i.e., at the center of P^. 

A 

We also include the posterior mode p. It is an in-between estimator 
in that, like the maximum likelihood estimate, it is a mode and, like the 
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posterior mean, it is a Bayesian estimate and utilizes prior knowledge. 
Unlike the posterior mean, however, the posterior mode can have zero 
components for a nonzero prior. Hence, it is a strong competitor for 
the maximum likelihood estimate for extreme values of p, those values 
near a boundary of the simplex. 

Finally, we note that the posterior mean minimizing expected risk 
depends on knowledge of the prior g(p). In practice, we would not know 
the true prior g(p). At best we would have some estimate of g(p) that 
has, in general, undeterminable error. To investigate how robust our 
results are to use of the correct prior, we compare the three estimators 
by using two wrong priors, as well as the correct prior, in their calcu- 
lations in the small-sample trinomial simulations. Note that the 
maximum likelihood estimate, not being a Bayesian estimate, is the same 
for all three studies. 

For the first wrong prior, we choose the uniform prior with vector 
of parameters (1,1,1) because of its common use when one is uncertain of 
prior knowledge. The uniform prior gives equal weight to all components 
of p. For this prior, the posterior mode equals the maximum likelihood 
estimate. For the second wrong prior, we choose the vector of parameters 
10x[v/10+(.09,.05,-.14)] , where v is the correct prior. This prior 
perturbs the three components of p by .09, .05, and -.14, respectively. 
Hence, we call it the perturbed prior. 
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1.3 Literature Review : 

To date, most of the published work on estimation from incomplete 
multinomial data has concerned maximum likelihood estimation. In 1958 
Hartley presented an iterative method for calculating maximum likelihood 
estimates from those sets of discrete data for which a maximum-likelihood 
procedure is available for the corresponding complete-data sample. Because 
his method was later generalized and clarified by Dempster, Laird, and 
Rubin (1977) in a paper described at the end of this section, we do not 
further discuss Hartley's method now. Hartley gave examples for the 
Poisson, negative binomial, and binomial distributions. Hartley also pro- 
posed calculating the large-sample covariance matrix of the maximum 
likelihood estimates by using the calculus of finite differences. He used 
the iterates from the maximum-likelihood-estimate algorithm to estimate 
the second derivative of the log likelihood function via the standard 
finite difference formula. 

Blumenthal (1968) considered maximum-likelihood estimation from 

incomplete multinomial data for the special case in which a category does 

not share data with more than one group of categories. That is, for the 

k-dimensional multinomial population, if category C. shares data with 

category C. for j in some subset 0 of the k+1 indices of p, then C. does 
J ^ 

not share data with any category for which h is not an element of 0. 

For the binomial case, Blumenthal also investigated the problem of non- 
random missingness. 

Hocking and Oxspring (1971) considered the case in which data comes 
from populations all related to the same "parent" population. In a related 



- 12 - 


population, at least one parameter is the sum of two or more probabilities 
from the parent population. Those parameters for the related population 
that are not such sums, exhaust those probabilities of the parent popula- 
tion that are not elements of these sums. Hocking and Oxspring derived 
the maximum likelihood estimates and their large-sample covariance matrix 
in the usual manner (e.g., the large-sample inverse covariance matrix is 
the Fisher Information for p). They developed an iterative algorithm for 
solution of the resulting nonlinear equations. 

A simple case of the Hocking and Oxspring situation is that of a 
parent population having probabilities p^, p^, and p^ and a related pop- 
ulation having probabilities p^+p 2 anc * P 3 * In general, however, we do 
not have sample information given twice on category C^. That is, we 
have sample data given for Pj, p 2 » p 3> and p^+p 2 and do not have data on 
broken into two groups to help, estimation. 

Sundberg (1974) developed maximum-likelihood theory for the general 
problem of incomplete data from an exponential family, of which the multi- 
nomial distribution is a member. He proved that the derivatives of the 
log likelihood with respect to the natural (exponential) parameters can 
be written as the difference of an unconditional and conditional expecta- 
tion of the complete-data sufficient statistics. He noted that this form 
for the first and second partial derivatives was first discovered in un- 
published work by Martin-Lof. [However, Efron (1977) noted that this 
form was implicit in Fisher's 1925 paper.] 

Dempster, Laird, and Rubin (1977) extended Sundberg 's work to the 
general case where the problem need not involve an exponential family. 

They called their algorithm the EM algorithm because it consists of an 
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expectation step followed by a maximization step. Although this is the 
same algorithm proposed by Hartley (1958), Dempster, Laird, and Rubin 
generalized the algorithm, clarified the techniques, improved the mathe- 
matics, and extended the history and usage of the algorithm. They proved 
that the EM algorithm converges to a local maximum or a saddle point when 
the likelihood is bounded and the matrix of second partial derivatives of 
the complete-data likelihood is negative definite with nonzero bounded 
eigenvalues. They also gave a formula for the rate of convergence close 
to a stationary point. Finally, they showed how the EM algorithm can be 
used to calculate a posterior mode. 

We describe the EM algorithm in the next chapter where we use it to 
calculate the mode estimators, the maximum likelihood estimate and the 
posterior mode. We also use the EM algorithm for solution of the approx- 
imation we develop in Chapter 3 for the exact posterior mean. • 


/ 



CHAPTER 2 


THE ESTIMATORS 


2.1 Introduction : 

In this chapter we give formulas for the estimators. In the next 
section we derive the posterior central moments. We begin with known 
formulas for the complete-data case and then, utilizing notation defined 
at the beginning of Section 1.2, derive elements of the posterior mean 
and covariance matrices for the incomplete-data case. We then illustrate 
these derivations with an example and discuss difficulties in the numer- 
ical computation of these exact moments. 

In the last section, we give derivations for the mode estimators 
based on theory from Sundberg (1974). We then show how values of these 
estimators are calculated with the EM algorithm of Dempster, Laird, and 
Rubin (1977). The first part of the section discusses the maximum 
likelihood estimate. The second part details results for the posterior 


mode. 
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2.2 Posterior Central Moments : 
2.2.1 _CompJ_e_te_D_at_a : 


For the k-dimensional Dirichlet prior g ( p ) in (1.1) and complete-data 
sample x=(x^, . . . ,x^ + ^) from the multinomial distribution with density (1.2), 
the posterior distribution of p given x has the k-dimensional density 


f(p|x) = 


k+1 k+1 k+1 v.-l k+1 k+1 x. 

[r( z v,)/ n r(v.)] n p , 1 [ n !/ n x . ! ] n P , 1 

i=l 1 i=l 1 i=l 1 i=l 1 i=l 1 


v.-l x. 

/•••/[r(zv.)/nr(v )]n p . 1 [n!/n x .!]np. % 


(2.1) 


k+1 k+1 k+1 x.+v.-l 

= [r(n+ z V.)/ n r(x.+v )] n p 1 1 . 

i=l 1 i=l 1 1 i=l 1 


Thus, the posterior distribution is again k-dimensional Dirichlet, this 
time with parameters x^+v.. for 1— i— k+1 . 

As is well known, the posterior mean of p. given x is 

k+1 

E ( p - 1 x ) = (x +v.)/(n+ Z v ). (2.2) 

1 ~ 11 j=l J 


Similarly, the posterior covariance matrix has elements 


k+1 


and 


var(p.|x) = (n+ z v,+l) E(p.|x) [l-E(p.jx)] 

1 ~ h=l " 1 ~ 1 ~ 

k+1 i 

cov(p.,p.|x) = -(n+ Z v +1) E ( p - 1 x ) E (p . j x) 

i j ~ h=1 h 1 ~ J ~ 


(2.3) 

(2.4) 


The vector of posterior means (2.2) is the Bayes estimator for quadratic 
loss defined in (1.6). 



-16- 


In general, for 1 a positive integer, 


, 1-1 1-1 k+1 

E(p . |x) = n (x.+v.+q)/ n (n+ 2 v +q) 
1 ~ q=0 1 1 q=0, h=l h 


(2.5) 


so that, from multinomial expansion and substitution of (2.2) and (2.5), 

4- L 

the 1 moment of p i | x about E(p.|x) is 


1 


1 


E{[p -E(p.|x)]'|x} = 2 
1 ~ ~ j=0 


(- 1 )' 


/ x , 


,+v, 1-j-l x.+v.+q 


1 i 


j/ \n+2v 




q=0 n+E V q 


1 

= 2 
j=0 


(- 1 )' 


j /1\ / x i /n+v j /n \ J 1-j-l x./n+(v.+q)/n 


,jj \ l+Iv h /n 


q=0 I+(Zv h +q) /n 


•1 


where we use the convention that n f(q)=l for any function of q. 

q=0 


( 2 . 6 ) 


2.2.2 J_nc:ornp2eJ: e __Da^t^ : 

Recall the notation defined at the beginning of Section 1.2. Let p 
again have the Dirichlet prior density g(p) of (1.1). Further, assume that 
given p, and thus all p 7 , each z 7 has the multinomial distribution 

P ~2,r 


h p (z |p) = [( 2 Z p)!/n z !] n P„ 2 ’ p . 

r , r ~ g e p £ 

Then, the likelihood of the total incomplete data z given p is 


(2.7) 


h(z|p) = n h p (zg' P |p). 

The posterior density of p given z is therefore 


(2.8) 


f ( P | z ) = g(p) h(z|p)// p g(p) h(z|p) dp. 

K k ~ ~ ~ 


(2.9) 
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To evaluate f(p|z), recall that p^ is p^ and that p^ is a sum 

H 

of probabilities; i.e., p ? = E p.. Thus, we can rewrite p, as a multi- 

i iei 3 1 

nomial expansion. For example, if p^p^+pg+Pg. then we can write p^ p 


(WPS)' 2 ■ ,!„(?)(() PjV’p^- (2.10) 


Rewriting the posterior density (2.9) in this manner, multiplying 
resulting terms times each other and the prior, and collecting terms 
yields the numerator as a sum of w terms of the form 


^lT 1 Y 21-l 

c lPl P 2 


Y (k+1)T 1 


( 2 . 11 ) 


where 1^1-w, oo=n{z n +l) for D containing more than one integer, c, is a 
D u 1 

function of the incomplete data only (hence, not a function of p), and 
k+1 k+1 k+1 

I Y-n = n+ Z v.=m. That is, E y--,=m is the sum of the prior parameters 
j=l Jl i=l 1 j=l JI 

v. plus the total number of observations and thus is independent of 1. 
[See following Section 2.2.3 for an example.] 

Hence, each term (2.11) of the numerator can be written as a Di rich- 
let density times a coefficient that is not a function of p Therefore, 
integrating the numerator with respect to p to evaluate the denominator 
yields that the posterior density of p given z is 


w k+1 y.,-1 w k+1 

f(p|z) = EC, n P . J1 /{ z [C, nr( Y .,)/r(m)]}. (2.12) 

1=1 1 j=i J 1=1 1 j=i Jl 


w k+1 

Let B= E c, II r(y.,). Then the posterior mean of p. given z is 
1=1 1 j=l Jl 1 



-18- 


1 a) k+1 

E(p.|z)=m" A Z c 1 r(y. 1 +l) Hrfy J/B. (2.13) 

1 ~ 1=1 1 11 j*i 

Similarly, 

o , a> k+1 

E(p/|z) = [m(m+l) ] -1 Z c.r( Y . 1 +2) n r(y..)/B (2.14) 

1 ~ 1=1 1 11 jjH J 1 

and 

. u k+1 

E(p,p h |z) = [m(m+l )] _1 Z c 1 r(y. 1 +l)r(y hl +l) n r(y )/B,(2.15) 

1 n ~ 1=1 1 11 hl j/i.h Jl 

for variance and covariance calculations 


var(p.|z) = E(p i 2 |z) - [E(p i |z)] 2 (2.16) 

and 

eovCp^ »P h J z) = E(p.p h |z) - E(p.|z) E(p h |z), (2.17) 


respectively. 


2.2.3 ExamjDl e_: 

We now give an example for a small artificial data set to illus- 
trate derivations given in Section 2.2.2. We also want to indicate 
difficulties that would be encountered in numerically evaluating these 
elements of the exact posterior mean and covariance matrices for 
larger or more complex data sets unless one has unusual computing 
equipment. 

We created the data in the more restrictive form of Hocking and 
Oxspring to show how their form relates to ours. Suppose that we have 
observed the following data on three categories , l-i-3. 
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8 

7 (2.18) 

_4 

19 = n 

where the arrows denote the two categories between which the incompletely 
specified observations fall. The amount of incomplete data is 32% of 
the total sample size. In the notation of Section 1.2, we have that, 



? 1 = {{1},{2},{3}}, * ltl ={ 1}, 2 2>1 ={ 2}, 2 3fl {3}; 

P 2 = {{1,2} {3}}, $ lf2 =U,2}, S 2>2 ={3}; 

P 3 = {{1,3} {2}}, 3 1>3 =U,3}, ? 2 ,3 ={2}; 

Z {1} =2, Z {2} =3+2=5, z { 3} =3+3=6, Z {1 ,2} =4 * z { 1 , 3 } =2 * z {2 ,3} =0 * 

z=(2 ,5, 6, 4, 2,0) , and n= I z, = 2+5+6+4+2=19. 

% 2 

From (2.7) and (2.8), the likelihood of z given p is 


l(p;z) 


- (2+3+3): 233 (4+3)! n t 

" 2:3:3: P 1 p 2 p 3 4:3: P {1 ,2} 


2 ( 2 + 2)1 2 
p 3 2 : 2 : P {1 ,3} 



8 : 7 : 4 : 

2: 3-; 3:413:2:2: 


2 5 6 4 2 

P 1 P 2 p 3 ( Pi + P2 ^ ^Pl +P 3^ * 


(2.19) 


Suppose that we have a uniform prior g(p^,p 2 )=2; that is, v^l for 
l-i-3 in the Dirichlet prior (1.1). Then, the posterior density of p 
given the incomplete data z is 
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Pl 2 P 2 5 P3 6( Pl +p 2 )4 (Pl + P 3 ) 2 


f (p|;> = 71 — 7T- 


n 


Pz P 1 2 P 2 5 P3 6 (P 1 + P 2 ) 4 (Pl + P3) 2 d Pl d P2 


( 2 . 20 ) 


Through expansion, multiplication, and collection of terms, the numerator 
of the posterior density (2.20) can be written as 


2_ 5_ 6 / 


a=0 


a_ 4-a, 


b=0 


b_ 2-bv _ 2_ 9 8 


3_ 8 8 


PlV"3 U < LU p lV >< Lib) P 1 p 3 ' P 1 p 2 p 3 + 4 p lV p 3 


+6p 1 4 p 2 7 p 3 8 +4p 1 5 p 2 6 p 3 8 +p 1 6 p 2 5 p3 8 + 2(p 1 3 p 2 9 p 3 7 +4p 1 4 p 2 8 p 3 7 + 

( 2 . 21 ) 

5 n ? 7 .. 6 6 n 7. 7 n 5 n 7, . 4 n 9 n 6.. 5„ 8 n 6 

+6pj p 2 p 3 +4p^ p 2 p 3 +Pj p 2 p 3 ) +Pj p 2 p 3 +4pj p 2 p 3 


6 7 6 . 7 n 6 6 8 5 6 

+6p l p 2 p 3 +4p l p 2 p 3 +P 1 p 2 p 3 ’ 


Adding v^-l=l-l=0 to each exponent in (2.21), we have that the 

numerator is a sum of oo=n(z n +l)=5x3 ss 15 terms of the form 

D u 

„ _ y lV l „ y 21 _1 „ y 3V l 


'1 *1 


p 2 


3 3 

with l y., = n+ l v. = 19+3 = 22 for all 1-1 -15 . Integrating the 

i=l 11 1=1 1 

numerator (2.21) with respect to p to evaluate the denominator yields 

the posterior density (2.12) of p given z. 

The smaller the variance of a distribution, the better a point 

estimate, such as the mean, is as a descriptor of the distribution. 
Therefore, as a rough indication of how large the variance is, we define 
a sample coefficient of variation 


C.V.(p i |z) = ,[var(p.|z)]^/E(p i |z). 


( 2 . 22 ) 


[Note that the coefficient of variation is usually defined as a standard 
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deviation of an estimator (not a distribution) divided by the 
estimator. ] 

Calculating the mean (2.13), variance (2.16), covariance (2.17), 
and sample coefficient of variation (2.22) yields results shown in the 
following Table 2.1. 


TABLE 2.1 

EXAMPLE 2.2.3 RESULTS 

'^nomervT^^^l^ 

1 

2 

3 

E( Pi |z) 

0.241202 

0.384927 

0.373871 

var(p.|z) 

.011921 

.012725 

.011203 

C.V.(p.jz) 

.4527 

.2931 

.2831 

cov(p 1 ,p 2 |z)=-0. 006721, 

cov(p 1 ,p 3 |z)=-0. 005199 

cov(p 2 ,p 3 |z)=-0. 006004 


* 


As expected, the sample coefficient of variation is highest for pj 
because category 1 has the highest proportion of shared data. [Compare 

( z 12 +z 13>/( z l +z 12 +z 13) = ’ 75 with Z 12 / ^ Z 2 +Z 12^ = ' 44 and Z 13 / ^ Z 3 +Z 13^ = * 25 * 1 
The posterior variance of p^ is larger, in proportion to the posterior 

mean of p^, than is that of p 2 or p 3 to their respective posterior means. 
2.2.4 Eva! u_a t_i on_Probl ems_: 

In general, we have the following problems in evaluating the exact 


posterior central moments: 
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(1) large number of terms - hence, pocket calculators and many desk 
calculators cannot be used; 

(2) rounding errors (in the large number of terms to sum, the products 
of gamma functions and factorial -like constants c^, approximations 
for the gamma functions, and final divisions, sums, and subtrac- 
tions) - hence, computers must carry many figures of precision; 
and 

(3) large magnitude of terms (each term is a product of generally large 
gamma functions and factorial -like constants c-j) - hence, computers 
must have an unusually large range of values unless much extra 
computer programing and execution cost, time, and storage are used. 

In the next few paragraphs, we discuss these problems and give several 
illustrative examples. An example of an unusual electronic computer 
that can be straightforwardly used to calculate these moments in small 
enough samples is discussed in Sections 5.4 and 5.10. 

The example given in the last section is among the smallest data 
sets one could have. Yet, even for it there are 15 terms in each of the 
numerators for E(p 1 |z), E(p 2 |z), E(p 1 2 |z), E(p 2 2 |z), E(p 3 2 |z), E(p 1 p 2 |z), 
E(p 1 p 3 |z), and E(p 2 p 3 |z). The denominator, the same for all calculations, 
also had 15 terms. Hence, there were 135 terms plus all the multiplica- 
tions within terms, additions, divisions, and subtractions to evaluate 
the final moments. For a trinomial sample having incompletely specified 
observations z i2 =z i3~ z 23 = ^* the number terms in each numerator (and 
the one denominator) is 1000. Hence, there are a total of 9,000 terms 
to evaluate, not including any multiplication within terms, addition of 
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the 1000 terms, and subtractions and divisions for the final moments. 

Finally, for a trinomial sample having incompletely specified obser- 
vations z 12 =16, z 13 =17, and z 23 =17 (corresponding to 50% incomplete data 
in a sample size of 100 and 15% incomplete data in a sample size of 330), 
there would be 5,202 terms in each numerator (and the one denominator) 
for evaluating the posterior mean and covariance matrix. Thus, the total 
number of terms, excluding the multiplications within terms, addition of 
the 5,202 terms, etc. would be 46,818. 

To evaluate these moments even on a large electronic computer can be 

difficult. Because of the gamma function in the terms, we need a 

computer having an unusually large range. In the second example, a term 
( 9 \ 19 \( 9 \ r* / o r \ -n/yinX — i n 134 IIAII i j ^ ^ J 


of r(35) r(40) r(32) = 10 AO would exceed the range of most 

electronic computers. Most have ranges smaller than 10~ 100 - 10 100 . Yet, 
depending on the prior, this is a term for a sample size of only 100, and 
this is only one of 1,000 terms. We can circumvent the range problem by 
dividing each term of the numerator and denominator by a large value; 
hence, scaling down the terms. However, doing so takes more computer 
programing and execution time, cost, and storage. Further, it also 
creates problems with roundoff error. We might also have to scale down 
more than once, depending on the values involved. Each successive such 
scaling involves increasing cost and roundoff error. 

The cost and time involved in evaluating these moments is important. 
The loss in precision, however, is critical. For the third example, a 
computer carrying even eight significant-figure accuracy will yield an 
answer for the exact solution that can be counted on for only one or two 
significant figures. [The large loss in precision owes to rounding errors 
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in approximations for the gamma functions, in the several multiplica- 
tions within each term, in the additions of the 5,202 terms (roundoff 
error is approximately /5202 or 2 to 3 significant figures), in the 
final divisions, additions, and subtractions, along with roundoff error 
from any divisions necessary to scale down the magnitude of terms to fit 
within the range of the computer.] 

If a computer carries six significant-figure accuracy, which is 
common, one might not get any accurate evaluation. Hence, any canned 
computer program would be particularly susceptible to wrong usage and 
interpretation. Someone not understanding the numerical problems or 
heeding any package warnings might use it on a six significant-figure 
single-precision accurate computer and think his answers were correct. 

On many large electronic computers, one can use double-precision 
significant-figure calculations. However, doing so would usually at 
least quadruple the cost. Further, on those large electronic computers, 
as well as those numerous kinds of desk and pocket calculators, not 
allowing double-precision calculations, or enough single-precision 
accuracy, there is no way to obtain an accurate evaluation of the exact 
posterior mean and covariance elements. 

One driving factor in these problems is the large magnitude of the 

terms. The other driving factor is the number w=n(z n +l) of these terms 

4 D 

in each numerator of E( p^^ | z) . As either sample size or percentage of 
incomplete data increases, u> increases. For a sample size of 200 and 
percentage of incomplete data of 50% with z i 2 =z 13 = ^ an< * z 23 = ^’ 
number of terms in each numerator for the moments is 40,460. Hence, the 
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total number of terms for just the numerators (excluding multiplica- 
tions and gamma approximations involved in each term) is 364,140. 

However, if we consider the same sample size and percentage of incomplete 
data for a 5-nomial having, say z^lS, z 23 = 34, z 3 g=3, z i 23 = ^» 

and z i 234 =17 » the number of terms in each numerator is 645,120 and the 
total number of terms for just the numerators (with same preceding 
exclusions) is 5,806,080. Hence, the problems illustrated for the tri- 
nomial data samples, as well as the cost, increase in somewhat factorial 
manner as the number of multinomial dimensions increase. 

Finally, it would be nice to have a short, easily remembered and 
easily evaluated, formula for at least the posterior mean. As Hoaglin 
(1977) notes, such a formula is valuable. It can be evaluated by pocket 
calculators anywhere. The maximum likelihood estimate and posterior 
mode, to be given in the next section, both have short, easily remembered 
formulas. Although these formulas can often be evaluated by pocket calcu- 
lator, they are not simple to evaluate in general. However, they are 
very easy and inexpensive to program for computer evaluation. In parti- 
cular, they do not have the three computational problems just outlined 
for the exact posterior mean. We find in Chapter 3 that we can derive a 
similar, although approximate, formula for the posterior mean. 
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2.3 Mode Estimators : 
2.3.1 Ba_ck£roujndj_ 


In this section we show how the maximum likelihood estimate and pos- 
terior mode are derived. First consider the complete-data equivalent x 
of z. Let denote the (unknown) number of the z^ observations that 

fall in category C^. Then, for 1-i-k+l, 

(D 




(2.23) 


For the theory of this section, we want to express the complete-data 
density 

k+1 k+1 x. 

(2.24) 


k+1 k+1 x 

h { x | p ) = (n!/ Z x !) n p. 1 
i=l 1 i=l 1 


in terms of exponential-family parameters. Therefore, for l^i^k, define 


*i * ’"VW- 

k+1 

Definition (2.25) and E p.=l yield that 

i=l 1 

k <}>. 

Vl ■ ’) 

and 4> . k <j> 

p. = e 7(1+ £ e J ). 
' j-i 


(2.25) 


(2.26) 

(2.27) 


For 1-i-k, define the sufficient statistics for p as 


t i ( x ) = x.. 

Then h(x|p) can be written in exponential-family form as 


(2.28) 


h(x| <J>) = b(x) exp[<J> t(x)']/a($) 

k+1 k <j>. k+1 

for b(x)=n!/ n x.! and a(4>)=(l+ Z e ) since Z x.=n. 

i=l 1 ~ i=l i=l 1 


(2.29) 
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2.3.2 Maximum U ke_]_ i _h o o d_E_s t j_m_a : 

For the multinomial distribution, the likelihood is the density. 

Thus, we seek to maximize h(z|<j)), the incomplete-data density (2.8) 
rewritten in terms of the exponential parameters <f>. From Sundberg (1974), 
the first and second partial derivatives of the log likelihood are 


3log[h(z|<J>)]/8<j> = -E[t(x)|cf>] + E[t(x) | z,cf>] 


(2.30) 


and 


3nog[h(z|<j>)]/(3<J>3<J>' ) = -cov[t(x)|<|>] + cov[t(x) | z,<}>] 


(2.31) 


At the maximum of the likelihood, the vector(2.30) of first partial deriv- 
atives is zero, so that 


E[ t(x) | <j>] = E[t(x)| z,4>] . 


(2.32) 


S i nee 


E[ti(x)|}] = np i . 


(2.33) 


and, from (2.23), 


E < z ii;*$) = z i 


and 


E ( 2 0 (l) L z '* ! ■ 2 0 P i /P D 


(2.34) 


(2.35) 


where, again, p = I p., evaluation of (2.32) yields that the maximum 
likelihood estimate p- of p.. is 


P i = P i /P D ]/n - 


(2.36) 
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To solve the nonlinear system of equations arising from (2.36), we 
use the EM algorithm of Dempster, Laird, and Rubin (1977). The algorithm 
is divided into two steps. In the expectation step (their E-step), the 
complete-data sufficient statistics t(x) are estimated by finding a solu- 
tion to 

[t(x)] (1) = E[ t(x)jz,^ (1) ]. (2.37) 

In the maximization step (their M-step), <J>^ + ^ is determined as the solu- 
tion of the equations 


E[t(x) |<j>] = [t(x)] (1) . 

Thus, translating back from 4> to p, we estimate an initial value p 

: ( 0 ) 


(2.38) 

( 0 ) 


of p.j for 1-i-k. We then substitute p , together with z, into the right- 
hand side z.+ E p/°Vpr/°^ of (2.37) and evaluate for [t(x)]^. Given 


,( 0 ) 


D3i 


[t(x)] v , we then solve (2.38); i.e., we solve 


np = [t(x)3 


( 0 ) 


for |5^; hence, p^ = [t(x)]^/n. We then successively repeat the E 


(2.39) 

E 

( 1 ) 


and M steps until convergence; that is, until successive values of p 
agree to the desired number of significant figures. 

Since we are concerned only with finite values of z, the likelihood 
h ( z | (p ) is bounded. Hence, the first condition of Dempster, Laird, and 
Rubin (1977) for guaranteeing convergence of the EM algorithm to a local 
maximum or saddle point is satisfied. Further, the complete-data multi- 
nomial distribution is a member of the regular exponential family. Hence, 
the last convergence condition is simply that the eigenvalues of cov[t(x) |<J>] 
be bounded above zero on some path joining all 4>^. From Graybill (1969, 
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pl87), the eigenvalues X are the solution to the characteristic equation 

k ? k 

[1- E p /(p.-X)] n (p.-X) = 0. (2.40) 

i=l 1 1 i=l 1 

In general, we want (2.31) to be negative semidef inite. 

Dempster, Laird, and Rubin (1977) give the rate of convergence of the 
EM algorithm. For the multinomial distribution the rate of convergence is 
the largest eigenvalue of 

cov[t(x) | z,4>^] (cov[t(x) |<j>^]} (2.41) 

for <j>^ the converged estimate of 4>^, provided that this eigenvalue is 
less than 1. As expected, when the percentage of incomplete data is small, 
the algorithm converges rapidly. As the percentage of incomplete data 
increases, the number of iterations increases. Dempster, Laird, and Rubin 
also note that, since the allocation of incompletely specified observations 
often varies across different components of p, certain components of p may 
converge rapidly while others may converge slowly. 

2.3.3 j^osteHor Modej_ 

The derivation for the posterior mode of p given z is similar to that 
for the maximum likelihood estimate. For the posterior mode, however, the 
prior must be included in the maximization. 

Recall from (2.9) that the posterior density of p given z is 

f(p|z) = g(p) h(z|p)// p g(p) h ( z | p ) dp. (2.42) 

k ~ ~ ~ ~ 

From definition (1.1) of the prior g(p), that piece of log[f(p|z)] from 
(2.42) that depends on p is the same as that piece of log[h(z(p)] that 
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depends on p except that, for 1-i-k+l, z. is replaced by (z.+v.-l) and, 

k+l 1 11 

hence, n is replaced by n+ £ v.-(k+l) . Therefore, from (2.36), the 

- j=l J 

posterior mode p of p given z is given by 


A 



k+l 

(z +v.-l+ £ z p /p )/[n+ £ v .- ( k+l ) ] 
11 D3i u 1 u j=l J 


(2.43) 


for 1-i-k+l. As for the maximum likelihood estimate, we evaluate the non- 
linear system of equations arising from (2.43) by the EM algorithm. The 
comments in Section 2.3.2 concerning convergence also hold for the poster- 
ior mode. In general, the prior should reduce the effect of incomplete 
data so that convergence should be somewhat faster for the posterior mode 
than for the maximum likelihood estimate. The numerator for the conver- 
gence matrix in (2.41) is given in Appendix 4D.2 for the maximum likeli- 
hood estimate. Derivation for the posterior mode is similar. Calculating 
second partial derivatives of the two log likelihoods for the complete- 
data case yields for elements of {cov[t(x)|<J)^]}" 1 in the denominator of 
(2.41): 

for the maximum likelihood estimate - 


d 11 ' = n[-l/p, (t) +l/p k+1 (t) ] 


and 


/S 1 "J A 

0 = "Pk+l 


1 

(t) 


(2.44) 


and for the posterior mode - 

k+l 


5 11 = tn+ E v — (k+l)] {-[p j (t) +v 1 -l]/(P 1 (t) ] 2 +[P k+1 '"'+v |<+1 -l]/[P k+1 "''n 


(t) 




and 


j=i 

j+i 


s ,J = m+Vv -(k+i)i [p k+1 (t) *vr 1 ]/[(W w ] <: . 


( t ) -, 2 


= i J 


j = l 


(2.45) 
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In most cases, the prior parameters v.. are greater than 1; hence, the 
denominator in (2,41) is usually larger for the posterior mode than for 
the maximum likelihood estimate. Which of the posterior mode and the 
maximum likelihood estimate actually has the faster rate of convergence, 
of course, depends also on the relative sizes of the numerators in (2.41). 

Note from the "-1" term that it is possible for (2.43) to be nega- 
tive. If so, the mode occurs at a boundary point; i.e., the posterior 
mode is zero. Also observe that if v.=l, for 1-i-k+l, then the poster- 
ior mode and the maximum likelihood estimate are identical. 



CHAPTER 3 


APPROXIMATIONS FOR POSTERIOR MEAN AND COVARIANCE MATRICES 
3. 1 Introduction : 

As discussed in Section 2.2.4, numerical evaluation of elements of 
the mean and covariance matrices of the posterior distribution of p 
given incomplete data z is unfeasible for all but those cases having 
only a small number of incompletely specified observations. Therefore, 
we seek approximations for these posterior moments. 

In the next chapter, we prove that the limiting central moments of 
p given z are corresponding moments of the limiting distribution. In 
particular, the limit of the posterior mean is the mean of the limiting 
posterior distribution. We also prove that the mean of the limiting 
posterior distribution is the maximum likelihood estimate (2.36). 
Finally, from equations (2.36) and (2.43), the posterior mode equals the 
maximum likelihood estimate in the limit and, hence, equals the limiting 
posterior mean. Therefore, two natural candidates to approximate the 
exact posterior mean are the maximum likelihood estimate and the pos- 
terior mode. However, there are also problems in using these estimates 
as approximations. 

The maximum likelihood estimate is best known for being good in 
large samples; it is not necessarily good in small samples. In 
particular, if a value of z . has been observed that has very small 
probability for given p.. , then the maximum likelihood estimate will be 
poor if the sample size is small. For example, if p^ = .20 and we 
observe z^ = 10 in a sample of size 25, then the maximum likelihood 
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estimate p.j = .40 is a poor estimate of p... Further, the maximum 
likelihood estimate is the correct estimate for an estimation criterion 
of choosing that value of p that maximizes the likelihood (2.8) and not 
for an estimation criterion of minimizing expected risk (1.5). Finally, 
the maximum likelihood estimate has no place for a prior, which is 
important in all but those cases in which the current data is of large 
enough sample size, or significantly greater relevance, to drown out 
past information. 

The posterior mode (2.43) does incorporate the prior. However, the 
posterior mode is the correct estimate for an estimation criterion of 
choosing that value of p that maximizes the posterior density given the 
prior density g(p) and observed data z and not for an estimation cri- 
terion of minimizing expected risk. Finally, from equation (2.43) we 
observe that, for small enough prior v i , a component of the posterior 

A 

mode p.j can be approximately zero even though an observation (z^l) has 
been observed. 

A different approach for approximating the exact posterior mean p 


is to note that the posterior mean of the complete-data Dirichlet 

density with prior parameters (v^ v k’ v k+l) ec l lja ^ s ^e posterior mode 

of the complete-data Dirichlet density with prior parameters (vj+1,..., 

v k + l ;v k+l + l) * that is » from *2) 

k+1 k+1 

(x-+v. )/(n+ z v.) = [x.+(v.+l)-l]/[n+ z (v,+l)-(k+l) ]. (3.1) 

11 j=l J 1 j=l J 

Therefore, paralleling the incomplete-data posterior mode (2.46), we 

could estimate the incomplete-data exact posterior mean (2.13) by 
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k+1 

L = (z, + v.+ I z p /p )/(n+ E v.). (3.2) 

1 1 1 D3i D 1 D j=l J 

A very important property of approximation (3.2) is that as the propor- 
tion E z n /(n+Ev.) of incomplete data goes to zero, approximation (3.2) 
D3i u J 

equals the exact posterior mean (3.1). 

However, there are problems with this approach to obtain (3.2). We 
find in this chapter that the relationship between the posterior mean 
and posterior mode for complete data does not hold for incomplete data. 
Thus, (3.2) is an approximation and this approach does not enable us to 
assess its accuracy. Finally, from consideration of the definition and 
from small-sample examples (one given at the end of this chapter), we 
do not expect the large-sample covariance matrix of the posterior mode 
or maximum likelihood estimate to be a good approximation for the exact 
posterior covariance matrix. Therefore, we seek another type of approach 
for estimating the exact posterior central moments. 

As noted, both the posterior mode and maximum likelihood estimate 
are derived from consideration of an estimation criterion other than 
minimization of expected risk (1.5). Therefore, one way to seek another 
approximation is to start with the desired estimation criterion; that is 
begin with the exact solutions for the posterior mean and covariance 
matrices. However, approximating exact solutions (2.13) - (2.17) for 
the posterior moments given incomplete data is difficult because of the 
number and structure of terms. An alternative method starts with exact 
solutions (2.2) - (2.6) for the posterior moments given complete data 
and then transforms these solutions via conditional probability to the 
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incomplete-data case, making any necessary approximations along the 
way. 

In this chapter, we follow the above approach to derive approxima- 
tions for the posterior central moments by making extended use of 
conditional probability and first-order Taylor-series approximations. 
Section 2.2 gives the posterior moments given complete data. Therefore, 
for incomplete data z we substitute fictitious complete data consistent 
with z and write the results of Section 2.2.1. Then, twice applying 
known lemmas on conditioning, we average results from the complete-data 
step over the posterior distribution of the unknown, substituted, com- 
plete data. At this point we still have unknown terms in the 
expressions. For these, we use Taylor-series approximations. The 
resulting approximation for the posterior mean is equation (3.2); hence, 
as the percentage of incomplete data goes to zero, the approximation 
goes to the exact posterior mean. From (2.36) and (2.43), neither the 
maximum likelihood estimate nor the posterior mode has this important 
property. Also, since asymptotically (3.2) equals the maximum likelihood 
estimate (2.36), it equals the limiting exact posterior mean. Further, 
since Taylor-series expansions are used, we can assess the accuracy of 
the approximations. Finally, we can use the same approach to approxi- 
mate elements of the posterior covariance matrix. Doing so, we find the 
same important property in the resulting approximations that they go to 
the exact posterior variances and covariances as the percentage of 
incomplete data goes to zero. Note that, since the Taylor-series 
approximation (3.2) for the posterior mean is also a posterior mode 
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[for the prior v^+1], it can be evaluated by the EM algorithm discussed 
in Section 2.3.2. 

In the next section, we derive the Taylor-series approximations for 
elements of the posterior mean and covariance matrices. Intermediate 
calculations are given in Appendices 3A, 3B, and 3C. Section 3.3 alge- 
braically illustrates the resulting approximations for the trinomial 
distribution. Section 3.4 concludes the chapter with a comparison of 
the Taylor-series approximations, maximum likelihood estimate, and the 
posterior mode on the small -sample data set given in Section 2.2.3. 
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3.2 Derivation of Taylor-Series Approximations : 
3.2.1 Posterior Mean Vector: 


Again let D denote the set . % from Section 1.2 containing more than 
one element and, for i e.D , define Zp^ as the number of the Zp observations 
that fall in category i. = If Zp^ were known for all i and all D, then the 
data would be complete and the posterior central moments would directly 
follow from Section 2.2.1. Therefore, assume that we know all Zp^ and 
denote the vector of this unknown information by u. Thus, u is the vector 
of all Zp^ for all D and all 1-i-k. For example, in Section 2.2.3 we 

would have that u=(Zj 2 ^ ,Zj 3 ^ ) and z= ( z i» Z 2» Z 3 ,z i2 ,z i3’ Z 23^ ' Given 
z and u, then, for 1-i-k , we have complete data 


x 


i 


£ 

2ai 



V E Z D 
1 Dai u 



(3.3) 


k+i 

Thus, from Section 2.2.1, recalling that m= £ v.+n, we have from (2.2) the 

j=l J 


posterior mean 


E( p i|z,u) 


(x.+v.)/m = (z.+ £ z n 
11 1 Dai u 


^ 1 ^+v i )/m. 


(3.4) 


To obtain moments of p given only the observed data z, then, we average 
result (3.4) over the distribution of u|z. To do so, write the posterior 
density f (p ( z ) as 


f(p|z) = / £(p,u| z) du 

= / 9(pl z,u) fc(u|z) du 


(3.5) 


for £(p,u|z) the joint posterior density of p and u given z and g(p|z,u) 
and h(ujz), conditional densities. 
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From (3.5) we obtain the following standard lemma [see Parzen (1962, 
p55) or Rao (1968, p79 ) ] on conditioning, which we write in terms of gen- 
eral random variables V and W because we apply the lemma to one other den- 
sity besides f (p| z) : 

Lemma 3.1 : For random variables V and W, and where the variable under the 
expected-value sign E is the variable with respect to which the expecta- 
tion is to be taken: E(V)=E[E(V|W)3, var(V)=E[var(V|W)]+var[E(V|W)], 

W W W 

and cov(V 1 ,V 2 )=E[cov(V 1 ,V 2 |W)]+cov[E(V 1 |W),E(V 2 |W)]. 

w w 

By using Lemma 3.1 and (3.4) we have, defining r.. D =p_./p D , that 


E(p |z) = E [E(p. | z,u) ] 
u| z ~ 

= E {[ Z z- (i) +v.]/m} 
u | z 2ai * 1 


= [z.+ Z E(z (i) | z )+v.]/m (3.6) 

1 Dai u ~ 1 

= {z.+v.+ Z E [E(z ^|z,p)]}/m 
1 1 Dai p|z u 


= [z.+v.+ Z z [ 
1 1 Dai D p| 


(r iD |z)]/m. 


The first line of (3.6) follows from applying Lemma 3.1 to E(p.|z); the 

second line, from complete-data posterior mean (2.2); and the third line, 

/ • \ 

from separating out that part of Z z ' ' that is already known. The 

2 (i) 

fourth line of (3.6) follows from applying Lemma 3.1 to E(Zp '|z); and 
the last line, from the complete-data multinomial specification. 
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In Appendix 3B we show through Taylor-series expansions that 

E (r |z) = E(p, |z)/E(p D |z) + 0(n _1 ), (3.7) 

Plz ~ 

where the symbol 0 giving the order of magnitude of the error is defined 
in Appendix 3B. Details are given in Appendix 3B. Therefore, substitu- 
ting (3.7) into (3.6) and collecting terms yields, for p.=E(p. |z) and 
the error e^ to be determined in Chapter 4, that 


= (z,+v,+ £ z n p./p n )/m + 
i i D3i u 1 u 


e i' 


(3.8) 


Dropping the error term in (3.8) yields, for 1-i -k » the Taylor-series 
approximate posterior mean vector p; i.e., 

p = (z.+v.+ Z zJ,/p D )/m. (3.9) 

1 1 1 D*i u 1 u 


Observe that (3.9) is the same approximation (3.2) obtained by 
paralleling the complete-data relationship between the exact posterior 
mean and the posterior mode. 

Calculations for Taylor-Series Approximate Posterior Mean : For 

those categories i that have only complete data, the Taylor-series 
approximation is the exact posterior mean (2.2). For those categories i 
that have incomplete data, we use the EM iterative algorithm of Dempster, 
Laird, and Rubin (1977) described in Section 2.3.2 since (3.9) is a 
posterior mode for the prior B^v^+1. Thus, for those categories i 
that have incomplete data, s denoting the number of iteration, and 


f ( s )=6 ( s )/n ( s ) 

r i D P i /P D ’ 


(3.10) 
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we approximate the exact posterior mean from (3.9) by the iterative 
algorithm 

^ (S+1) = [z.+v^lzjr^h/m. (3.11) 

To begin (3.11), we use the data z, prior parameters v., and any 

• ( 0 ) 

other available information to choose an initial estimate p. ' , 1-i-k, 

•, (ol • (01 

and thus an initial estimate of r^ D v . Substituting r.^ ' into the 

right-hand side of (3.11), we evaluate (3.11) to obtain p..^ and r.p^ 

. ( i } 

for all i referring to categories having incomplete data. Using r.p , 

* f 2 1 

we then reevaluate (3.11) to calculate p^ v ' . We continue in this 
cyclic fashion until results from successive iterations agree to the 
desired number of significant figures. 

Note that the system of k equations arising from (3.9) for the 
Taylor-series approximate posterior mean is nonlinear. Thus, as for 
the maximum likelihood estimate (2.36) and the posterior mode (2.43), 
the number of solutions to this system can range from zero to infinity. 
[See Ortega and Rheinboldt (1970, p2 ) . ] If there are solutions, none 
need be in P^. If a solution is in P^, it need not be close to the 
exact posterior mean. However, since (3.9) is a posterior mode for the 
prior B=v+1, Dempster, Laird, and Rubin (1977) give conditions (dis- 
cussed in Sections 2.3, 4.3.2, and 5.8.3 and Appendix 4E) under which 
an iterative solution for (3.9) converges to a local maximum in P^. 
Hence, when these conditions are met, there is at least one solution in 
P^. In Chapter 4 we give conditions under which an iterative solution 
converges to within a small error of the exact posterior mean. We also 
speculate that this solution, when it exists, is given by the global 
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maximum in P^, which is found by choosing that one of the local 
maximum in P^ that maximizes the likelihood. 

When there are only a few patterns of incomplete data, the non- 
linear system of equations arising from (3.9) for the posterior mean 
vector can sometimes be solved analytical ly. Several solutions will be 

obtained but usually all but one will fail to satisfy the constraints 

k+1. 

O-p.-l and E p.=l. Examples of analytic solutions for the asymptotic 
1 i=l 1 

posterior mean and covariance matrices are given in Appendix 4D.5. 

3.2.2 Posterior Covari ance_Mat ri x : 

For approximating elements of the posterior covariance matrix, we 
follow the same procedure given in the last section. For the complete- 
data step that lead to (3.4), we obtain 

var(p i |z,u) = (E(p i |z,u) [1-E(p i |z,u)]}/(m+l) (3.12) 

and 

cov(p i ,p h |z,u) = - [E(p i |z,u) E(p h |z,u)]/(m+l). (3.13) 

For the conditioning step that lead to (3.6), we obtain 

var(pjz) = E [z n 2 var(r- n |z)+ Z z n z n cov(r. n ,r.Jz) ]/[m(m+l) ] 

1 ~ Dai u p | z 1U ~ Qai u g p|z ~ 

Q ^ D (3.14) 

+{ Z (z n /m) E [r in (l-r. n ) |z]+E(p- |z) [1-E(p- |z) ]}/ (m+1) , 

Dai p|z 1U 1 


and, for h^i , 
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cov(p. ,p, |z) = E E z n z n /[m(m+l)] cov(r. n ,r h Jz) 

1 n ~ Dai Qah u g p | z 1U hg ~ 

-[ E (z D /m) E (r iD r hD |z)+E(p i |z)E(p h |z)]/(m+l). 


(3.15) 


Derivations for (3.14) and (3.15) are given in Appendix 3A. 

Finally, for the ratio-approximation step that lead to (3.8), we 
have, with ratio moments given in Appendix 3B and substitution details 
for (3.14) given in Appendix 3C, that 

a-. = E (z n /m)[(z n -l)/(m+l)]/p n 4 {[p [ , 2 a.,+p. E [p,a,,-2p R a,,+2p, E a,,]} 

11 Dai u ip 1 1 ij ef3 i jj v JM 'i e p J 1 

l>j 

+ E E (z n /m)[z n /(m+l)]/(p n p n ) 2 {p R [pp.a ,,-p, E a., ]+p . E [p. E a-.-p^a,, ]} 
Dai Qai u g u v w 11 ’lej} 11 ’jefl le0 J 1 gjl 

. Q^D 

+[^E (z D /m)p i p 0 /p D 2 + p. (l-p.)]/(m+l) + (3.16) 

and, for h>i and ID denoting D minus the integer h, 


3 ih "oil Q Jh <z D /m)Cz Q /<m+1)]/(fi D 2 PQ 2){ PB E P0 s ih-^h 1 | () 3 n ]+ Pi j | J!1 [ Ph 1 | Q a o 1 -pQ s jh ]} 

-[ E (z D /m)/p 4 (p.p h Pn 2+ P 0 P[D3 ih -P h P 0 E a- ’-p.p. E d ih +p h p. E E 5^) 
Dai,h u u 1 n u V v in n ^jeiD 1J 1 "jep Jn jefl leD J 


+P i P h ]/(m+l) + 6 ih , 


(3.17) 


where a ii =var(p i |z) , a^cov^. ,p h | z ) , 0 and 0 denote D and Q, respec- 
tively, minus the integer over which they are summed (so that 0 is D 
minus i and 0 is Q minus h or Q minus i depending on the definition of Q 
given under the summation sign), and, again, p.=E(p. |z) so that 
p.j j=p.j+Pj. The terms 6^. and 6 ih represent the error made by 
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approximating posterior moments of the ratios r..^ in equations (3.14) 
and (3.15), respectively. 

Dropping the error terms in (3.16) and (3.17) and then solving the 
resulting nonlinear system of equations for 1-i-k and i<h-k yields the 
Taylor-series approximate posterior covariance matrix with elements a., 
and 5.^, Note that, as for the Taylor-series approximation for the 
posterior mean, the Taylor-series approximate posterior covariance matrix 
goes to the exact posterior covariance matrix as the percentage of 
incomplete data goes to zero. 

Calculations for Taylor-Series Approximate Covariances : Thus, to 

solve the nonlinear system of equations for the Taylor-series approxi- 
mate posterior covariance matrix, first note that for those categories 
that have only complete data, 

5 ii = P i (l-P i )/(m+l) (3.18) 

and, for category h also having only complete data, 

5 ih = ■PjPh / ( m+1 )» (3.19) 

in agreement with (2.3) and (2.4), respectively. Recall that p.=p. and 
p^=p^ in this case of complete data. 

For those categories i that have incomplete data, results are a 
noniterative estimate of 5.^ for category h having only complete data and 
a choice of iterative and noniterative estimates for elements a. . for 

• J 

category j, as well as i, having incomplete data. 

For category h having only complete data and category i having 
incomplete data, we approximate cov(p^,p^|z) by 
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S ih = -p/^ P h /(m+l) (3.20) 

for p^' ' denoting the converged estimate v ' from (3.11). Approxi- 

mation (3.20) is noniterative in 5.^. 

For i and h referring to categories that have incomplete data and 

for s again denoting the number of iterations, we can write (3.16) and 

(3.17) as iterative algorithms. To do so, we drop the error terms 6^ 

and write 5.^ on the left-hand side of (3.16) and (3.17) as a.j^ s+ ^ 

• (t) 

and p.j and o.^ on the right-hand side of these equations as p^' ' and 

^ [ c ) • (t ] 

, respectively, for p. v ' denoting the converged estimate from 
(3.11). These equations are given for the trinomial distribution in 
the next section. 

( 0 ) * ( 0 ) 

To obtain initial estimates a.. ' and o., v , we assume, for the 

n lh 

first iteration only, that the ratio p-P-j nonrandom - Wl * th 
this assumption, we have from (3.11), (3.14), and (3.15) that 

,5 ii (0) = [p i (t) (l-p. (t) )+ D Z (z D /m) f iD (t) (l-f iD (t) )]/(m+l) (3.21) 

and 

S,h (0) = h (V m > (3 - 22 > 

The second procedure for estimating elements of the posterior 

covariance matrix for those q categories that have incomplete data is 

noniterative in a... For both i and j referring to categories having 
3 « ■ 

incomplete data, a-j h coefficients of a lh , and b. . a term that is not a 
function of for any 1 or h, we can write equations (3.16) and 
(3.17) as 
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3 


ij 


Z Z 
Dai Qaj 


[ z z a 

leD heQ 


lh 


5, , ] + b. . 
lh ij 




(3.23) 


where we note that 6.- also contains terms in 5.. [for example, second- 

1 J 1 J 

order terms in the approximation for E ( r^ ^ { z ) are terms in S^.]. 

Thus, we can write (3.23) as a linear system of q(q+l)/2 equations 
in the q(q+l)/2 unknowns 5^ and 3^.: 


[A + 6 a ] 3 = B [I + 5g] 1 


(3.24) 


where 3 is the q(q+l)/2xl vector of 5.. for both i and j referring to 

~ I J 

categories having incomplete data, A is the q(q+l)/2xq(q+l)/2 matrix of 
the a^, B is the q(q+l)/2xq(q+l)/2 matrix with b^ on the diagonal and 
0's elsewhere, I is the q(q+l)/2xq(q+l)/2 identity matrix, 6^ is the 
q(q+l)/2xq(q+l)/2 matrix containing those terms in 6.. that are terms 

* J 

in 5, 6g is the q(q+l)/2xq(q+l)/2 matrix containing zeros on the off- 

diagonal and the remaining terms of 6^ divided by b^ on the diagonal, 

and 1 is the q(q+l)/2xl vector containing all 1 1 s . 

The Taylor-series approximation 5.. for these terms 5.. of the 

■ J ' J 

covariance matrix is then given from (3.24) by dropping the error terms 
and 6 g ; substituting the converged approximation p^ ; from (3.11) 

~ ^ £ Jt 

for p.j in A and B, yielding the matrices A and B, respectively; and 
computing 5 as 

3 = A" 1 B 1. (3.25) 


The tradeoff between the two procedures to approximate elements of 
the posterior covariance matrix for those categories that have 
incomplete data is the cost of the one-time expense of the larger- 
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dimensional operation (3.25) in the noniterative procedure versus the 
cost of iteratively evaluating the smaller-dimensional [q(q+l)/2]*l 
covariance vector written directly from (3.16) and (3.17). In the next 
section we illustrate these Taylor-series approximations by writing them 
for the general case for the trinomial distribution. We conclude the 
chapter by giving a numerical example. 
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3.3 Algebraic Trinomial Illustration : 


Suppose that, having taken minimization of expected risk as the cri- 
terion for choosing a point estimate of the posterior distribution of 
p=(p 1 »P 2> P 3 ) given incomplete trinomial data z=(z 1# z 2 ,z 3 ,z ,z 13 ,z ), we 

want to calculate elements p^, p^, and p 3 =l-p^-p 2 of the posterior mean 
vector. Suppose also that, for the same estimation criterion, we have past 
estimates p^ calculated from a recent data sample of size h and prior 


parameters v . , whence we calculate new prior parameters 

. k+1 _ 

v.=(n+ e v.)p.. 

1 .1=1 J 1 


(3.26) 


If we had no information other than z, we could set v..=l to obtain a 


uniform prior. 

Recall that p,-^=p.^+p.^ and that r . .^=p . Vp • 

• J ' J ' O * ' 

from (3.11) iterative estimates of elements of p are given by 


Then, 


and 


§ ( s+1 ) =fz +v +z r (s) +z r 
P 1 U 1 V 1 Z 12 12 z 13 

~ (s+1) , . . a ( s ) , * 

p 2 "^2 2 12 21 00 


13 


(s) 


(s) 


23 23 


)/m 


)/m. 


(3.27) 


To choose an initial estimate p.^ to calculate r..^ for (3.27), we 

1 1 J 

use the previous estimate p^, theoretical results (such as from genetic or 
engineering laws), and/or current data. Then, calculating r..^ for 

* \J 

l-i,j-3 and substituting results into the right-hand side of (3.27), we 
iterate on (3.27) until results converge. 

To estimate the posterior covariance matrix, we have from (3.16) and 


(3.17) that 
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2 4 

Ojj “ [ (^i2~ ^ )/ (m^l) ] { P ^ 

+( Z 13/m)[(Zi3-l )/, nH-l)Kp3 2 5ii+PiEP r e33-2p 3 ai3:>/Pi3 4 

• I 

+2(z 12 /m)[z 13 /(m*l)]{p 2 [i^ 

+{(2 12 /m)p 1 p 2 /(p 12 ) 2 +(z 13 /m)p 1 p 3 /(p l3 ) 2 + p 1 (l-p 1 )}/(m+l) + 6 n> 


°12 Z 12 / ^ m ( m+ 1 ^ ~ P 2 ^ll + ^Pl^2^12"^l ^22^^12 

+Z 12 Z 23/[ rn ( m+ l)]{^2[^3 a l2'^2 a 13^ + ^1^2 a 23"^3 a 22^ / ^12^23^ 
+z 12 Z 13 / '- m ^ m+1 ^^3^1 5 12”^2 5 ll-* + ^1^2 5 13~^l 5 23^ / ^12^13^ 2 (3.28) 


+Z 13 Z 23^ rn ^ rn+ ^^^3^3 a 12”^2 a 13^ + ^1^2 a 33~^3 cr 23^ // ^13^23^ 

~ { ( ^ 22 ^) t P i P 2P 1 2 ~?2 ^ 11 + 2 ^ 1 ^ 2 ° 12'^1 ^ 22^^12 + ^ 1 ^ 2 ^^^^ + ^ 12 ’ 


2 4 

a 22 = (z 12 /m)[(z 12 -l)/(m+l)]{p 1 5 22 + P2^2 5 ir 2 ^l 5 12 ]}/ Pl2 

2 4 

+ (z 23 /m)[ (z 23 -l)/(m+l) Hp 3 3 2 2 + P2^2 5 33" 2 ^3 5 23' !} ^23 

2 

+2(z 12 /m)[z 23 /(m+l)]{p 1 [P 3 a 22 -p 2 a 23 ]+P 2 [P 2 a 23 -P 3 a 12 ]}/(p 12 P 23 ) 
+{(z 12 /m)p 1 p 2 /p 12 2 +(z 23 /m)p 2 p 3 /p 23 2 + p 2 (l-p 2 )}/(m+l) + 6^. 


To estimate the posterior covariance matrix by the iterative procedure, 
we iterate on (3.27) until the convergence condition is met on, say, the t th 
iteration. Then, for f.=p. f..-p..^, and we rewrite (3.28) 

l i * J • J u U 

as 



•fO 
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A.> +1 > . , 2 


11 - z 12 /[m(m+l)]{f ? a 

2 


2J (S, ' 2f 12®lJ (S)+f l 2 5 2 2 (S> >/f 12 4 
+ z 13 ^ rm( m + l) ]ff 13 2 5ll < = >H-2f lfl3 g 12 ts) +f ) }/fl3 4 

+2 z 12 z 13 / [m(m+ i ) „ ViAi (s) tfi[f2 . fi 3 J j i2(s) _ fi2 ^ (s)>/(f ^ fi3)2 
-{(z 12 /m)[f 2 2 o 11 (s, -2f 1 f 2 5 12 (s) + f 1 2 a 22 (s).f if2fi2 2 )/fi2 4 
Mz 1 3/ m ) [ fi 3 \ 1 (s) +2flfl 3 Si2 ( s) +fi2a22 (s). fiVi3 2 ]/fi3 4_ fi(1 _ fi)}/(B+i)i 


+ zi2 z 13 /[ , "( m+ i)]f-ViA 1 (s ) t f it fl 3- f2 :3 12 ( s ) + f 1 2 5 22 (s ) }/( f 12 f 13 ,2 

t4 1 2 Z 23 /[n ’( m+1 » { -f 2 25 11 (S)+ f 2 If 23 -f 1 )S 12 (S )- flf23 3 22 )/ (fl2 f 23) 2 (3.29) 

V^W)H f Mi f, ViV^ (, SVa l, ,»« f i 3 y 2 

- {(z 1C) /m)[-f ^ s ^+2f f 5 f (s),* f 4 2-..0 4 

I 2 2 11 + ^1 T 2°12 ~ f l°22 +f l f 2 f 12 ]/f 12 + f i f 2 }/ ( m+1 )> 


* 22 (s+1) . ** (S) 


i2 5 11 {s) -2f 1 f 2 5 12 ( s ) + f 1 2 5 22 ( s ) }/fi2 4 

+ z 23 Z /[m( m+1)] (f ;) 2 S n (s) +2f2f33ai2 (s) +f232§32 ( s))/f33 4 

(s >< 

-{(z 1 ^)[fA 1 (,, - 2 fM 2 ,S W 1 ! i 22 ‘*)- Wl 2 2 , /fl2 « 

+ ( V m >[ f 2 %i (S, ^ 2 f 23 5 12 (s) V 3 22 (s) -W 2 3 2 Vf 23 4 -f 2 (l-f 2 ) } /( m+ l). 


«Z 12 2 2 ,/[^l)]J-f^ n «*W 2 [ fl - fa ]S 12 (*V if23S22 (s) )/(fi , 


xhere we calculate initial estimates S if <0) and 5„ (0) from (3.21) and (3.22) 


1J 


as 



cu* 
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5 (0) 
°11 


= [f 1 (l-f 1 )+(z 12 /m)g 12 (l-g 12 )+(z 13 /m)g 13 (l-g 13 )]/(m+l) ! 


12 


22 


( 0 ) _ 

) 

( 0 ) . 


- [f 1 f 2 +(z 12 /m)g 12 g 21 l/(m+l) 

[f 2^ 1 “ f 2^ + ^ Z 12 /m ^ 9 21^ 1 “ 9 21^ + ^ Z 23 /m ^ 9 23^ 1 “ 9 23^ / ^ m+1 ^ 


(3.30) 


After evaluating (3.30) we iterate on (3.29) until results converge. 

To estimate the posterior covariance matrix by the noniterative pro- 
cedure, we substitute in equations (3.28) for a^, a 23> and in terms of 


33 


^11’ ^12* and °22’ co11ect terms » an d rewrite equations (3.28) as 


-[f 1 (l-f 1 )+(z 12 /m)g 12 g 2l +(z 13 /m)g 13 g 31 ]/(m+l) + r n 

= 5 n^" 1+ (( z i2 9 21 /f 12 +Z 13 /f 13^ 2 " z 12^ 9 21 /f 12^ 2 ' Z 13 /f 13 2]/[ri1 ^ m+1 ^ } 

r 

+2a 12 C(z 13 9 13 /f 13" Z 12 9 12 /f 12 )(z 12 9 21 /f 12 +Z 13 /f 13 )+Z 12 9 12 9 21 /f 12 

- z 13 9 13/ f 13 2 ]/ [m(m+ l)] 


+a 22^ Z 12 9 12 /f 12" Z 13 9 13 /f 13^ ' Z 12^ 9 12 /f 12^ +Z 13 ^ 9 13 /f 13 ^ ^ 


f l f 2 [1+Z 12 / ^ rnf 12 2 ^ / ^ m+1 ^ + T 12 

= ° li n z 12 92l /f 12 +Z 13 /f 13^ Z 23 9 23 /f 23" Z 12 9 2l /f 12^ +Z 12^ 9 21 /f 12^ 
+S 12 { - 1+[(z 12 f l /f 12 2+z 23 (1 - 2f l )/f 23 2 H z 12 f 2 /f 12 2+z 13 (1 - 2fr 2 )/f 13 2 > (3 ’ 31 > 

+ z 12 (z 12 -2)g 12 g 2l /f 12 + z 13 Z23^ f 12~ 2f lV^ f 13 f 23^ ^^(mfi)]} 

+5 22^ Z 12 9 12 /f 12 +Z 23 /f 23^ Z 13 9 13 /f 13" Z 12 9 12 /f 12^ +Z 12^ 9 12 /f 12^ 
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-[f 2 (l-f 2 )+(z 12 /m)g 12 g 2 l +(z 23 /m)g 23 g 32 ]/(m+l) + x 22 

= a n[( z i2 9 21 /f 12" Z 23 9 23 /f 23^ - z i2^21 /f 12^ 2 " Z 23 (9 23 /f 23^ ]/[m(m+1 ^ 

+2a 12^" Z 12 9 12 /f 12' Z 23 /f 23^ Z 12 9 21 /f 12" Z 23 9 23 /f 23^ +Z 12 9 12 g 21 /f 12 

2 

" Z 23 9 23 /f 23 


+a 22^"^ + ^ ^ Z 12 9 12 y/ ^12 +Z 23^ 23 ^ -Z 12 ^ 9 12^12^ -Z 23^23 


for T -jj =<S jj P^ us the error made from approximating p.. by f. 



Dropping the error terms t.., we have that equations (3.31) are three 
equations linear in the approximations c^» °i 2 ’ and °22 of the P° ster ' ior 
covariances a^, 0 ^, and 022 * respectively. That is, we approximate 
elements of the posterior covariance matrix by 



(3.32) 


where ^ = (^u»§12’^22^’ ^ is the 3x3 coefficient matrix of a from the. 
right-hand side of (3.31), and B is the 3x1 column vector of constants 
given on the left-hand side of (3.31). 
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3.4 Numerical Example : 

<r 

We now compare the Taylor-series approximations for elements of the 

posterior mean and covariance matrices with approximations given by the 

maximum likelihood estimate and by the posterior mode on a small sample. 

We use the example given in Section 2.2.3. Initial estimates for the 

• (01 • (01 • (01 

iterative algorithms were p^ '=1/4 and p^' '=pg '=3/8. The condition 
for convergence was that the absolute relative difference | p^ s+ ^-p. ^ | 
/p.^ be less than 0.001 where p/ s ^ denotes the s^ iteration of approx- 
imation p.. . Because a uniform prior was used, the posterior mode equals 
the maximum likelihood estimate. 

Results from these approximations are given in the following Table 
3.1. The Taylor-series approximations are by far the better approxima- 
tions for elements of the posterior mean and covariance matrices. Fur- 
ther, they are excellent approximations for such a small sample. For 
example, values of the Taylor-series approximate posterior mean differ 
from the three corresponding elements of the exact posterior mean by only 
0.3%, 0.1%, and 0.1% in percentage absolute relative difference 100* 

IpL-p.. |/p.j . Corresponding percentage absolute relative differences for 
the maximum likelihood esimate {= posterior mode) are 9.7%, 3.8%, and 
2.4%, respectively. 
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Since a uniform prior is used, also equals posterior mode 
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3.5 Summary : 

In this chapter we considered three different approximations for 
elements of the posterior mean vector and one approximation for elements 
of the posterior covariance matrix. The maximum likelihood estimate (2.36) 
and posterior mode (2.43) were considered because they asymptotically equal 
the limiting posterior mean. However, as discussed in the first section, 
there are problems with using these two estimates to approximate the 
posterior mean. We then derived an approximation by conditioning twice 
from the complete-data posterior mean and using Taylor-series expansions 
for the unknown terms. An important property of the resulting Taylor- 
series approximation is that as the percentage of incomplete data goes to 
zero, the approximation goes to the exact posterior mean. Neither the 
maximum likelihood estimate nor the posterior mode has this property. The 
Taylor-series approximation also relates to the posterior mode (2.43) in 
the same manner that the complete-data posterior mean relates to the 
complete-data posterior mode. Because the Taylor-series approximation is 
thus a posterior mode (for B^v^+l), we were able to solve its nonlinear 
system of equations by the EM §]f§H thill SM§§y§§§d! In Sect i gn 2.3.2, 
Appro^imatigns for the posterior mean §nd their §©mj)lete=data counterparts 
are given in the following Tabl§ 3.2. 

The same approach of egnditi§nin| and using faylornsgries expansions 
was also used to derive apprg^imgtigfis fgp elements gf the ggsterior co- 
variance matrix. The resulting approx iffigtigfis alsg hgve tpg important 
property that as the pergentage ©f if?ggfnplg|§ data goes to zero, the approx- 
imations go to the exact elements ©f th§ po§t§ri©p ©©variance matrix. We 
showed how to solve the system of eguatigns from £h§ approbations either 
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iteratively or noniteratively, once the posterior mean has been approximated. 

We illustrated the Taylor-series approximations algebraically for the 
trinomial distribution and then compared them numerically with the maximum 
likelihood estimate and the posterior mode for a uniform prior on a small 
sample. 
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APPENDIX 3A 

CONDITIONAL DERIVATIONS FOR TAYLOR-SERIES APPROXIMATIONS 
FOR POSTERIOR COVARIANCE MATRIX 


3A. 1 Posterior Variance: 


For l^i-k and Q, like D, denoting $ containing more than one 
element. 


var(p.|z) 


= E [var(p. | z,u) ] + var[E(p. | z ,u ) ] 
ujz ~ u|z 

= E {[m 2 (m+l)r 1 [z m + Z zJ 1 W. ] [m-(Zr, , 
u|z {1} Dai D 1 {1} 

+ D3l Z D <i)+V 1 ))>+m ' 2 uU [2t f>V3i ZD<i>+Vl1 

■ {[m - (z (i} +v i) llz { 1) +v ) + D f 1 E < z D <1) i;> 1 -< z {i> +v i) 

X I E(z D (i) |z)-E[( Z z D < i) |z) 2 ] 

Dai U ~ Dai U 

+(m+l)var[ Z zJ 1 ^ |z]}/[m 2 (m+l) ] 

Dai u 


= ({Hi-[z {i}+ v i+D Z i Z Dp yr. D |z)]}{z {i}+ v i + 

n z . z D ? (r iD^ )} ' Z . {z D f [r iD (1 - r iD } l z] 
D31 pjz D31 p I z 

^varfr^lz) ^Vq ^ Q .r, Q | z) } 

Qi'D - - ■ 


+(m+1 1 oil Z °p I z 1 ri D< 1_ri D> 1 5 1+2 °‘ 


v f( r i dI?> 

p|z 


+ qZ. z D z q c ov(r iD ,r. Q |z)})/[m^(m+l)], 
Q^D 


(3A.1) 
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because 


and 


E[( l z 
D^i 


E( z d (i) !z) = E [E(z D (l) |p,z)] - z D E (r 1 D |z) 
p|z PI Z 


(3A.2) 


) (i) |z) 2 3 =^{E[(z D (1) |z) 2 ]+^E(z D (i) z Q (i) |z)} 


D^i 


Q^i 

Q^D 


= ^(var(z D ( 1 ) |z) + [E(z D ( 1 ) |z )] 2 

+ i: {cov(2p^\zQ^^|z)+[E(z D ^^|z)][E(2Q^^|z)]}) 


Q 31 

Q^D 


= Z 
O^i 


E [var(zJ l3 |z,p) ]+var[E(z D v |z,p)] 
P|z U ~ ~ p|z 

+{ E [E(z (l) jz,p )]} 2 

El* 

+ Z ( E [cov(z D ^,z 0 ^|z,p)] 

QM \p|z u w 
Q^D ~ " 

+cov[E(zJ 1 ^|z,p),E(z 0 ^ 1 ^|z,p)] 

p|z 


(i) 


(3A.3) 


+{ E [E(z D (i) |z,p)]}{ E [E(z 0 (i} |z,p)]>) 
P ! z u ~ ~ P I z W ~ ~ ~ 


= D^ { ^p!z Cr ^ a - riD)l 5 ]+ZD 2 p?z <r ^ l 5 )+ZD 2 [ P iz ( ^ ol5)]2 
t „L U 0 z Q“: (r 1D >r iQ 1 ? )+z 0 z Q p f 2 (, 'lDl5> n f, (r iQ l 5 ))) 


Qzi uv p|z 
Qr'D 


p|z 


= d V Z D p y r 1D (1 - r 1D ) l? 3 + Z D var (r i D l^ 
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+ E z n z n cov(r 
Q3i D Q pjz 
Q^D 


iD’ r iQl?) } 


[n D E (r. D |z) 
DM D pjz lD - 


and 


var( E zJ^lz) = E [var(z D ^|z) + Z covfzJ 1 ^ ,z Q ^ ^ |z) ] 

DM u ~ Dai u " Qai u g 

Q^D 

= z ( E [var(zJ^|z,p)] + var[E(zJ^|z t p)] 

Dai ' p | z u ~ ~ pjz u 

+ E { E [cov(zJ^,z n ^|z,p)] (3A.4) 

Q^D ~ - 

+ cov[E(zJ^|z,p),E(z 0 ^^ |z,p)]}) 
p|z u w / 


+ DBi {2D p|z CriD(1 ‘ r i D)| ~ ]+Z D V p r> lZj 

\l 1 Z 0Xlz (rW,r ^ ZjL 

Q?D ~ ~ 

since cov(z D ^ ^ ,Zg^ ^)=0 for Q^D. 

Therefore, combining terms in (3A.1) and recalling (3.6), 


var(p. |z) = E (z n / [m(m+l) ] var(r.Jz)+ E z n z n /[m(m+l) ] cov(r. n ,r.Jz) } 
1 ~ Dai u pjz 1U ~ Qai u 4 pjz 4 ~ 

Q?iD ~ ~ (3A.5) 


+ 


{E( Pi |z)[l-E( Pi |z)3 


+ E (z D /m) E [r. D (l-r iD )|z]}/(m+l). 
Dai pjz 
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3A.2 Posterior Covariance 


Finally, for l-i,h-k,i^h, and Q defined as in 4A.1, we have that 


cov(p.,p h |z) = E [cov(p i ,p h |z,u)] + cov[E(p i |z,u),E(p h |z,u)] 

- - [nftm+l)]" 1 E {[z, n + E z n (i) +v. ][z f . ,+ E z n (h) +v. ] 
u I z {1} Dai D 1 {h} Qah Q h 


+ m" 2 cov[Zr.,+ e z n ^ ^ ^+v- iZr. 1+ E z n (h) +vj 
u I z {1} Dai D 1 {h} Qah Q h 


(3A.6) 


= { <*{h}^> D | i z D p ^^ r i D l5> + < z Ci} +v i> Cz {h } +v h- , - Q | h z Q p ^ 2 ^ r hQl5> 3 
■_ ? . z n ? < r ,n r nhl?> + 2 E z n z n[ c ? v ( r in* r hnl?) 


Dai ,h p | z 


U U L . v v i n * hn I ; 
D3i Q3h u w p | z 1U ng 


because 


+ E (r.Jz) E (r h0 jz)]-(m+l)[ E E z D z Q cov(r. D ,r hQ |z) 
p | z 1U p | z D3i Qah u g pjz 1U ng 

- E z n E ( r in r hnl z ) ]} 

Dai ,h p| z lDhD ~ 


E [( E z D (l) )( E z Q (h) )] = E E E(z D (l) z Q (h) |z) 
ulz Dai Qah 4 Dai Qah y 


= E E { E [cov(z D ^^ ,Zg^ h ^ |z,p)] 


Dai Qah p|z 


+cov[E(Zr ) ^ 1 ^ |z,p) ,E(Zg^ h ^ |z,p) ] 
p|z 


(h) 


(3A.7) 


+ E [E(z D (i) |z,p)] E [E(z 0 v "'|z,p)] 
P I z u ~ ~ p|z w 


= “ E Z D ^ ^ r iD r hD^ Z ^ + 2 ^ Z D Z Q 

Dai.h u p | z 1U nu ~ Dai Qah u 4 
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( ^ (r iD- r hQl5 )+ „L (r i D l5)„L <r h Q | 5 )} 


p|z 


p|z 


and 


cov( Z z D (l ^, z z 0 ^ h ^) = 
u | z D3i u Qsh g . 


E Z { E [cov(z D ^\z n ^ h ^|z,p)] 
Dsi Qsh p | z g ~ 


+cov[E(z d ^ ^ 
p|z 


|z,p), E (zq^ h ^ 


Z,p)]} (3A.8) 


= 'o^.h^plz^ 10 ^ 0155 * * 


Z z n z n cov(r. n ,r. n |z). 
Dai Qah D Q p|z lD hQ ~ 


Therefore, combining terms in (3A.6) and recalling (3.6), 


cov(p.,p.|z) = Z Z (z n /m)[z n /(m+l)]cov(r. n ,r. n |z) 
1 n - Dai Q3h u g pjz 1U ng " 


(3A.9) 


-[ Z (zp/m) E (r. D r hD |z)+E(p i |z)E(p h |z)]/(m+l). 
Dsi ,h p|z 
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APPENDIX 3B 

APPROXIMATIONS OF RATIOS AND THEIR MOMENTS 
3B.1 Introduction 


For ieD, consider the ratio r - /Pn for Pn = £ P-* Define 

1U 1 u u jeD 3 

e-=[p--E(p-|z)] }z and let e n be the vector of e. for jeD. Let 

E(p|z) = (E( j z) , ... ,E(p k |z)). Define 0 to be D minus the integer i. 

Let d n (w. , n w.,) be a vector function of dimension equal to the number 
U 1 j€0 3 

of integers in D and be defined by 


d n (w. , n w,,) = w./ E w, (3B.1) 

U 1 j€0 3 1 j«D 3 

[Thus, for D={1,2,3) and i=l, 0={2,3> and d n (w, , n w.,) = d n (w, ,w 9 ,w,) 

u 1 je0 3 u 1 

= w 1 /(w 1 +w 2 +w 3 )]. Then, for £€D, 


3d 


£ w./( E w,)' 

D<V"V /8 "* -jw JeD , 




-W./( E W.)‘ 
jeD 3 


for £=i 


for lf\ 


(3B.2) 


To characterize errors in the ratio approximations, we define the 
Landau symbols 0 and o and their stochastic parallels 0 p and o p . [See 
Bishop, Fienberg, and Holland (1975 ,chpt . 14) , Cox and Hinkley (1974, 
chpt.9), Cramer (1951 ,chpt. 12) , and Schmetterer (1974, pl7) . ] Let ||y|| 

k+1 p j, 

denote the length ( E y. ) 2 of the k-dimensional vector y. 

i=l 1 


Definition 3B.1 : For {a p > a sequence of real numbers or vectors and 

[b n } a sequence of positive real numbers 
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a. a n = °( b n ) if there exists a number K and an integer n(K) such that 
if n exceeds n(K) then ||a n || < Kb n ; 

b. a n = °( b n ) if f° r every e>0, there exists an integer n(e) such that 
if n exceeds n(e) then Ijajj < eb n . 

Definition 3B.2 : For a(x) and b(x) continuous functions of the real 

number or vector x. 

a. a(x)=0(b(x)) as x— if, for any sequence {x n > such that x n — ►y, 
a(x n )=0(b(x n )); 

b. a(x)=o(b(x)) as x-*-y if, for any sequence {x n > such that x^-*-y, 
a (x n )=o(b(x n )). 

Definition 3B.3 : For random variable, or vector of random variables, 

V n and sequence {a n } of positive real numbers 

a. V n =0 p ( a n ) if for every n>0 there exists a constant K(n) and an 
integer n(n) such that if n-n(n), then P-C || v n II/a n -K(n) >-l-n ; 

b. v n =0 p ( a n ) if f° r every e>0, lim P{ ll v n || /a n ~c} = 1. 

^ n-*» 

Lemma 3B.1 ; For 0, o, 0 p , and o p as just defined: 

a. For the nonzero constant c, 0(cx n )=0(x n ) and o(cx n )=o(x n ) . 

b. 0(o(x n ))=o(x n ); o(0(x n ))=o(x n ); 0(0(x n ))=0(x n ) ; and o(o(x n ) )=o(x n ) ; 

c. o(x n )+0(y n )=0(||x n || + ||y n |i); o(x n )0(y n )=o(x n *y n ) ; and 0(x n )0(y n )=0(x n *y n ) 

d. x n =0(a n "^) implies that x n =o(a n "^ +Js ) but x n =o(a n " J+Js ) does not imply 
that x n =0(a n ‘^). [For example, let x n =c/n 3 / 4 for c a constant.]; 
and 

e. a. through d. hold if 0 is replaced by 0 p and/or o by o p with the 
exception that if a subscript p appears anywhere on the left-hand 
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side of an equality in a. through d., then a subscript p must 
also appear on the right-hand side. 

To justify results from calculating the expected value of the error 
terms, we have that 


Lemma 3B.2 : for j an even integer, 

|E{ n (p h -p h )|z}| * | E { n (p h -p h ) | z} | = 0(n“ j/2 ) 

9=1 99 9=1 9 9 

where again p h =E ( p. |z). 

9 9 ~ 

A proof of Lemma 3B.2 is given in Chapter 4. Thus, Lerrana 3B.2 gives the 
magnitude of elements of the posterior covariance matrix and proves that 
posterior central cross-product moments significantly decrease as their 
order increases. Therefore, Taylor-series approximations in this 
appendix are valid. 

From definition 3B.2, we can write the first-order Taylor-series 

expansion of d n (p. |z, n p.|z,) = r. n |z about the value 
u 1 ~ jeP J ~ 1U ~ 

d D (E(p, |z), n E(p. |z) ,) = E(p - 1 z)/E( P D | z) = E(p.|z)/ Z E(pJz) 
u 1 ~ j 6 0 1 ~ 1 ~ u ~ 1 ~ j£D J ~ 


as 



= E ( p • | z) / Z E(p.jz) + e D [9d D /3E(p|z)j 
1 ~ j€D J u 



(3B.3) 


I 

as pjz— ►E(plz), for [ 8dp/8E( p | z) ] denoting the transposed vector of 

9d n (w. , II w-)/3w,, for £eD, evaluated at w=E(p|z). That is, for £eD, 

1 j*0 J : ‘ ~ ~ ~ 
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3d D /aE( Pil |z) = ' j£D 


E E(p . | z)/[ E E(p.|z)] Z for 5,=i 

c n 


j€D 


(3B.4) 


- E( Pi | z )/ [ E E(p,|zr] for Jtfi. 
^ 1 ~ jeD J ~ 


By Tchebychev's inequality and the definition of Op, 


Pi |£ - ECPi |z) = 0 p ([var(p.|z)] 1/2 ] . (3B.5) 


From Lemma 3B.2 we have that var(p^|z) = 0(n -1 ). Therefore, by Lemma 
3B.1, the error term in the first-order Taylor-series approximation 
(3B.3) of r.j D |z is 


o(||e D ||) = o([ £ e 2 ] 1/2 ) 
~ U jeD J 

= o[O p (rf 1/2 )] 

■ V"’ 1/2) - 


(3B.6) 


Because we know the magnitude of e D , we can also write (3B.6) as 


°( II ®d II ) = V"" 1 ) 


(3B.7) 


Recalling from Lemma 3B.2 that the expected value of the error term 
with respect to the posterior distribution of p given z is small rela- 
tive to the first-order terms, we approximate moments of each ratio 
r^ D jz by calculating expected values of the left- and right-hand sides 
of each r iD |z Taylor-series approximation. 

Recall that E = (5..) is the posterior covariance matrix of p 

I J 

given z. Let E Q denote that portion of E that pertains to jeD. That 
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ls * £o is the matrix that resuits 

and columns for X4D. 


f nom deleting from g a]] z th 


rows 
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3B . 2 Posterior Mean ; 

I ^ 

Then, since E(e D )=0 and E[e Q e Q ] = Z Q , we have from (3B.3), 
(3B.6), and (3B.7) that 


r iD |z = E(p.|z)/ Z D E( Pj |z) + e D [3d D /9E(p|z) ] + o p (n" 1/2 ) 


(3B.8) 


and 


E(r.Jz) = E ( p - 1 z ) / Z E(pJz) + o(n" 1/2 ) 
1U ~ 1 ~ jeD J ~ 

= E(p.jz)/ Z E ( p • | z ) + 0(n -1 ) . 

1 ~ jeD J ~ 


(3B.9) 


Note that we can write 0(n _1 ) in the last line of (3B.9) because 

- 1/2 -1 
o(n ' ) in the first line comes from an n" term. [Recall (3B.5) - 


(3B.7) . ] 
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3B.3 Posterior Variance: 


Similarly, 


r iD 2 l? 


C E ( P i | z)/ r^E(pj j z) j + [3d D /3E(p|z)]e D e D [3d D /3E(p|z) ] 

+ 2{[E(p. | z)/ E E(p . | z) + o (n" 1/2 ) ] e Q [3d n /3E(p|z) ] ' 
1 ~ jeD 3 ~ p ~ u u 

+ [E(p. | z)/ E E(p| z) ] o (n" 1/2 )} + o (n _1 ), 

1 ~ jeD ~ ~ p p 


(3B.10) 


so that 

E(r iD 2 |z) = [E(p.|z)/ ZE(p.\z)] Z + [3d D /3E(p|z) ]E D [3d D /3E(p|z) ] 

+ 2[E(p,|z)/ E E(p|z)]o(n“ 1/2 ) + o(n -1 ) 

1 ~ jeD " ~ 


(3B.11) 


or 

E(r. 2 |z) = [E(p. ] z) / E E(p.|z)] 2 + E [ 3d D /3E( p. | z) ] 2 5- . 
1U ~ 1 ~ jeD J ~ jeD u 3 ~ 33 


+2 E E [3d n /3E(p.|z)] [ 3d n /3E(p ,|z)]a., 
jeD U J - U * ~ 

A>j 

+ 0(n" 1 ). 


(3B.12) 


For use in Chapter 4, substitution from (3B.4) into (3B.12) yields 


that 


E(r 1D 2 | 2 ) = C[E( P| ,|z) I : 8 la ♦ E(Pili)i. - [E(p 1 |z)1 ‘,L 3 j 


js? 


tEtpplzll^^.J/CECpulz) + [E( Pi |z)/E(p D |z)]‘ + 0(n -1 ) 


f “ “ v -i n J 

jejj s,e0 J*' 
A>j 




JJ 


(3B.13) 
-L 


) 
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From (3B.9) we have that 


[E(r-Jz)]‘ = (E(p,|z)/ z E(p, | z ) ] c * 2 [E(p,|z)/ z E(pJz)] o(n _I/2 ) 

iu ~ 1 ~ ivn J ~ 1 ~ -i c n J ~ 


j £ D 


j £ D 


(3B.14) 


+ o(n _1 ) 


or 

[E(r. D |z)] 2 = [E(p.jz)/ E E(p.|z)] 2 + OCn -1 ). (3B.15) 

iu ~ 1 " jeD J ~ 

Therefore, from equations (3B.11) and (3B.14) we have that 

var(r- D |z) = [9d D /3E(p| z) ] E D [3d D /3E(p[z) ] + o(n _1 ) 

= [3d /3E(p|z) ] L [3d n /3E(p| z) ] ' + 0(n _3/2 ) 

u ~ ~ ~ u u ~ ~ (3B.16) 

= Z [3d /3E(p.|z)] 2 a., + 2 E E [3d D /3E(p • |z) ] 
jeD u J ~ JJ jeD £&D u J ~ 

A>j 

x [Sdp/SEfpJz)]^ + 0(n" 3/2 ). 

Substituting from (3B.4) into (3B.16) yields that 


var(r.Jz) = {2E(p.|z) [-E(p 0 |z) E a.. + E(p.|z) E E a.J 
1U ~ 1 ~ 10 ~ jefl J1 1 ~ je(3 lep 

l>j 

+ [E(p«|z) ] 2 a- z+[E(p. |z) ] 2 E o..}/[E(p D |z)] 4 (3B.17) 

1 ° ~ 11 1 ~ j €0 U ~ 

+ 0(n" 3/2 ) . 
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3B.4 Posterior Covariance: 


Similarly, for all cases except those for which i=h at the same 
time that D=Q, 

r 1D r hql? ■ [E(p f U)/.I D E(Pj|z)] tE(p h | 2 )/J q E(p a |z)] 

+ e D { [E(p. | z)/ E E(pJz)][3d D /3E(p|z)] } 

~ u n ~ £eQ ^ u 

+ e Q {[E(p.|z)/ E E ( p . | z ) ] [3d Q /3E(p| z) ] ' } 

i ~ j £D J - W 

(3B.18) 

+ [3d D /3E(p|z)] e Q e Q [3dg/3E(p|z)] 

+ o (n" 1/2 ) {E(p- 1 z ) / z E(p . | z) + E(p. | z ) / E E(p |z) 
p 1 ~ jeD J “ n ~ £€Q * ~ 

+ e Q [3d 0 /3E(p|z)] + e Q [3d Q /3E(p| z) ] ' > + o p (n“ 1 ). 

Therefore, 

E(r iD r hol z) = CE(Pi 1 z >/ Z E(pJz)] [E(p h |z)/ Z E(pJz)] 
iu ng ~ 1 ~ jeD J ~ n ~ 2,eQ * ~ 

+ [3d 0 /3E(p|z) ] I DQ [ 3d Q /3E(p| z) ] ' 

( 3B • 19) 

+ o(n _1/2 ) {E(p. | z)/ z E(p.|z) + E(p. | z)/ z E(p | z) } 

1 ~ jeD J ~ n ~ l£Q * ~ 

+ o(n _1 ) 

for E D q being the matrix whose elements are 5^ for all jeD and all 
S,eQ. That is, if and only if jeD and £eQ. If k D is the number 

of integers in D and kg is the number of integers in Q, then the 
dimension of Eqq is k D *kq. 
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Recall that 0 denotes D minus the integer i and let 0 denote Q 
minus the integer h. Then, substitution from (3B.4) into (3B.19) yields 
that 


E(r 1D r hql! ) * <E <Ppl?>[E(p 0 |z)s 1h -E(p h |z) ri 5 U ] 

*■ -K (3B.20) 

+E(P 1 l?) j y E(P h l - ) J/j l ' E(P 0 l!)5 Jh I)/[E(P D l ? )E<P Q l - )l2 

+ E(P 1 -|z)E(p h |z)/[E(p D |z)E(pg|z)] + 0(n -1 ). 

From (3B.9) and (3B.19) we have that 

E (r iD |z)E( r hQ|z) = E(r io r hp |z) - [3d D /3E(p|z) ] Z DQ [3d Q /3E(p|z) ] 

+ o(n -1 ). (3B.21) 

Therefore, from (3B.19) and (3B.21), 

c ° v ( r i d * r hQ I ~ ) = [3 d D / 9E <Pl£)^ 5dQ C 3d Q /aE(p | z) ] + o(n _1 ) 

3/2 (38.22) 

= [3d D /3E(p|z)] l DQ [3d p /3E(p|z) ] + 0(n i/c ) 

- l Z [3d n /3E(p.|z)] [3d n /3E(pJz)] o u + 0(n' 3/2 ). 
jeD *eQ U 3 ~ 4 x ~ 

Substituting from (3B.4) yields that 


C0v(r io >r hql?> = < E {p p |z)[E<p 0 |z)5 ih - E (p h [z)j Q a iJt J 

+ E(p.|z) Z [E(p. | z) Z 5 U - E(p 0 |z)a. h ]} 
1 ~ je0 n - *60 31 n ~ Jn 

/[E(p D |z)E(p p |z)] 2 + 0(n" 3/2 ). 


(3B.23) 
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APPENDIX 3C 

INTERMEDIATE CALCULATION FOR VARIANCE 


From (3.14) 


var(pjz) = E (z D ^var(r. n |z)+ E z D z n cov(r. D ,r, 0 (z)]/[m(m+l)] 
D^i u pjz ~ Qal v p | z v ~ 

Q^D 


(3C.1) 


+{ E (z D /m) E [r. n (l-r. n |z)]+E(p i |z)[l-E(p. |z)]}/(m+l). 
D^i u pjz 1U 1U ~ 1 ~ 


Substituting from (3B.17), (3B.13), and (3B.23), we have, for 0 
denoting D minus the integer i and 0 denoting Q minus the integer i, 
that 


var(p.|z) = E (z D /m)[z n /(m+l) ]/[E(pJz) ] 4 {2E(p, |z)[E(p, |z) E E a.„ 
1 - D3i U U U ~ 1 ~ 1 ~ je0 m J 



j£0 JJ 

(3C.2) 


+ E E (z n /m)[z 0 /(m+l)]/[E(p D |z)E(p Q |z)] 2 
D^i Qai y w ~ 

Q^D 


x{E( P p|z)[E(p 0 |z)d.. - E(p.|z) E a iJt ] 


+E(p.|z) E [E(p- i z) E a u - E(p 0 |z)a..]} 
1 ~ j£D 1 ~ w ~ 

+ |^E (z D /m) (e(p 1 |z)E(p 0 |z)/[E(p D |z) ] 2 


-{2E(p.|z)E(p 0 |z) E a.. + 2[E(p.|z)] 2 E E 5. 
1 ~ v ~ je0 1J 1 j£0 ££0 

*>j 


+ [E(p 0 |z)] a.. • 
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/(m+1) + 6^ 

for 6^ denoting the error in (3C.2) made by approximating posterior 
moments of the ratios r^ in (3C.1). 

Therefore, 


+ E(p.|z)[l-E(p i |z)] 


3,, - l (z D /m)[(z D -l)/(mH)]/[E(p 0 |z)] 4 {[E(pp|z)] 2 3 11 

+E(p.|z) Z [E(p i |z)o..-2E(p 0 |z)d..+2E(p 1 |z) Z d.J} 
1 ~ j60 1 ~ JJ v - J* ~ A60 

*>j 


^ Z^(z D /m)[ZQ/(m+l)]/[E(p D |z)E(pq|z)]^{E(pp|z)[E(p 0 |z)a ii 

Q*D 


(3C.3) 


-E(p.|z) Z a.J+E(p.|z) Z [E(p.|z) E o . A -E(p 0 1 z)a . i ] } 
1 ~ JL€Q 1Jl 1 ~ j € 0 1 ~ ££Q 31 w 

+{ Z (z D /m)E(p i |z)E(pp|z)/£E(p D | z) ] 2 


+ E(p i |z)[l-E(p 1 |z)]}/(m+l) + 6 1i . 



CHAPTER 4 


ASYMPTOTICS FOR TAYLOR-SERIES APPROXIMATIONS 
4 . 1 Introduction : 

In Chapter 3, we used low-order Taylor-series expansions for un- 
known terms in deriving Taylor-series approximations for the posterior 
mean and covariance matrices. For these Taylor-series expansions to 
allow accurate approximations, higher-order central cross-product 
posterior moments of p must be substantially smaller than lower-order 
central cross-product moments. In this chapter we prove this condi- 
tion. We then assess the accuracy of the Taylor-series approximations. 
Because results are in terms of orders of magnitude or otherwise involve 
limiting distributions, we call this chapter the asymptotics for Taylor- 
series approximations. For the asymptotics we use the sampl ing-theory 
approach. We fix the probability p and then study the limiting distri- 
bution of the data as the sample size n goes to infinity. 

In the next section we determine the magnitude of the central 
cross-product moments and show that this magnitude substantially de- 
creases as the order of the moment increases. The first part of the 
section gives results for complete data; the last part, results for 
incomplete data. In the third section we assess the accuracy of the 
Taylor-series approximations for the posterior mean and covariance 
matrices. We begin by giving the accuracy for the ratio approximations 
of Appendix 3B. A summary concludes the chapter. 

Five appendices give derivations used in the chapter. The first 
appendix calculates the posterior central moments given complete 
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multinomial data. The second appendix derives the limiting posterior 
distribution given complete data. The third appendix calculates cen- 
tral moments of the k-dimensional multivariate normal distribution, 
giving results more general than found in the literature. The fourth 
appendix derives the limiting posterior distribution given incomplete 
multinomial data. Finally, the fifth appendix gives the error in 
evaluating a function by an iterative solution of an approximation to 
the function. Note that techniques developed in the appendices are 
applicable to distributions other than the Dirichlet or multinomial. 



-76- 


4.2 Central Cross-Product Moments : 
4.2.1 Complete Data: 


In this section we obtain the order of magnitude of central cross- 
product moments given complete multinomial data x. We begin by obtaining 
the order of magnitude of the 1 n posterior central moment (2.6). To do 
so, in Appendix 4A we write (2.6) in a Taylor series in (n+Ev^) ^ about 0 
for enough values of 1 to detect a pattern for the low-order term in n 
We then extend moment results from Kendall and Stuart (1969,vl,pl48) for 
Pearson distributions to prove by induction that for 

1!! = 1 ( l-2)(l-4)(l-6) — 1 for 1 odd*, 


and 


we have that 


Mi = E(Pi | x) , 


k+1 


(4.1) 


a = var( Pi |x) = y (l-y.)/(n+ E v ), 

1 1 ,i=l J 




for 1 even 


E[(p.-y.) 1 |x] = { 


(l-l)l!!o.. (1+1 ^ /2 (l-2M i )/[3M.(l-M i )] for 1 odd, 


(4.2) 


where the approximation in (4.2) means that we have given the lowest-order 
term in n [Recall from (2.6) that E[ (p^-y . ) ^ | x] is a function of n.] 
Hence, noting the n in the denominator of in (4.1), we have that 


for 1 even 

lim n 1//2 E[ (p— y.) 1 |xl - (1-1)! ![ y i (l-y i )] 1/ ' 2 
n-*» 

and for 1 odd 




> (4.3) 


lim n (1+1)/2 E[(p.-u.) 1 |x] = (l-l)l!!(l-2y.)[p i (l-y i )] (1 " 1)/2 /3. 
n-*» 


•k 

Standard mathematical notation; for example, see Gradshtevy and Ryzhik 
( 1967 , px 1 i i i ) ; 1!! is not defined for 1 even. 
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Therefore, 


EKPj-Ujl'lx] 


o (n - ,/2 ) 


o(„- (1+1)/2 ) 


for 1 even 

(4.4) 

for 1 odd. 


In Appendix 4A we also found that 

* E[(p i -y 1 ) 1 (p j -p j ) h |x) = 


0(n f or i+h even 

0 ( n -(l+h+ 1 )/2) for 1+h odcL 


for 2-1 ,h-8. 

However, the methods of Appendix 4A were unfeasible for evaluating 

th 1 

the general 1 posterior central cross-product moment E[ n (p. -y. ) | x ] 

9=1 9 h g ~ 

for l-hg-k and h^h^ for at least one g. Therefore, to obtain general 
results similar to those given in (4.4), we use the Helly-Bray Theorem 
[Rao (1968, p97) ] : 


Theorem 4.1 (Helly-Bray Theorem) : If the distribution function F n con- 

verges to the distribution function F, then 

/ g dF n -> / g dF 

for every bounded continuous function g. 


1 

Since n (n, -y ’ ) is bounded and is continuous in p, by Theorem 4.1 

9=1 % h 9 

limits of posterior central cross-product moments equal corresponding 
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moments of the limiting distribution. The latter moments are usually 
referred to as the asymptotic moments. [See Bishop, Fienberg, and 
Holland (1975, p485) . ] Hence, we calculate the limiting distribution of 
the posterior distribution of p given complete data x and then calculate 
the central cross-product moments of this limiting distribution. 

By using Stirling's approximation [Cramer ( 1951 ,pl30 ) ] for the log- 
arithm of the gamma function, theorems from Graybill (1969, p8, 170, 184) 
to calculate the determinant and inverse of the covariance matrix, series 
approximations [CRC Tables (1962, p373 ) ] for log(l+e) for |e|<l, and 
Tchebychev's inequality [Bishop, Fienberg, and Holland (1975, p476 ) ] to 
determine the magnitude of error in approximations, we prove in Appendix 
4B that the k-dimensional Dirichlet density with mean u and covariance 
matrix E differs from a k-dimensional multivariate normal density with 
mean y and covariance matrix E by order of magnitude 0 p (n ' ). [Recall 
definition 3B.3 of 0 p .] Rao (1968, pl04) gives the following convergence 
theorem involving densities: 

Theorem 4.2 : If the density f n (x) converges to the density f(x) as it* 30 , 

then the distribution function F n ( x ) converges to the distribution func- 
tion F(x) as rr* 30 . 

Therefore, from Theorem 4.2 the limiting posterior distribution of p 
given complete data x is N^y.E). 

To obtain central cross-product moments of this limiting distribution, 
in Appendix 4C we multiply the multivariate-normal moment-generating func- 
tion [Wilks ( 1963 , p 168 ) ] by exp(-ty'), continuously differentiate the 



-79- 


results with respect to t, and then set t to 0 in the differentiated 
results. Doing so yields that the central cross-product moment for the 
multivariate normal distribution is zero for 1 odd and, for 1 even, is 
a sum of 1-1 terms, each of which is a product of 1/2 elements of the 

- 1/2 

covariance matrix and thus, from (2.3) and (2.4), is of magnitude 0(n ). 

Therefore, application of these results with the Helly-Bray Theorem 
t h 

yields for the 1 posterior central cross-product moment that 

for 1 even, E[ n (p. -y. )|x] = 0(n -1 ^ 2 ) (4.5) 

9=1 g g 

for 1-h^k. For 1 odd, however, these results yield only that 

1 

for 1 odd, lim E[ n (p. -y. )|x] = 0. (4.6) 

g=l h g h g 

Therefore, to calculate the order of magnitude for odd posterior central 
cross-product moments, we have the following lemma: 

Lemma 4.1 : for 1 a positive integer, 

1 1-1 

|E[ n (p. -y ) I x] I * |E[ n (p. -y. )|x]|. (4.7) 

9=1 h g h g ' 9=1 h g h g ' 

Proof : 

First note that, since h can equal h. for any l^a,b-l one of the 1 

61 D 

values of g, the density function f-j for the I*"* 1 central cross-product 
moment will be of dimension l^a-k. 

In going from the (l-l) st to the 1 th central cross-product moment, . 
the density function will remain the same if the additional variable p h 
for p h -y h is a variable of f-j In such case, the proof follows from 
the fact that, for all g. 
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-!- p h -1. (4.8) 

9 9 

th 

Hence, the integrand for the 1 n posterior central cross-product moment 

s t 

is a fraction of that for the (1-1) central cross-product moment and 
(4.7) therefore follows. 

st th 

If. the additional variable p^ in going from the (1-1; to the i n 

central cross-product moment is not a variable of f, ,, then for P the 

l-i a 

a-dimensional simplex of the vector of those distinct probabilities p. 

1 9 

in n p, and P 1 the (a-l)-dimensional subspace of P obtained by 
„ i n a- 1 ct 

g=i 9 

deleting variable p, , we have that 

n l 

1 1 
|E[ n (p. -y, )|x]| = |/ p n (p h -y )f dp| 

9=1 h g h g ~ P a g=l g h g 1 ~ 


* |r P„ , „ n , (p h -M hn )[/(p h ■ M h 1 )f l dp h. ldpl (4 - 9) 
a-l g=l g g 11 1 

- I ; P , „V P h ’“h )f t-l dpl = |E[ 7 p h ■' 1 h ,l ? )l 
a-l g=l g g g=l g g 


since (4.8) yields that /(p h -y h )-f ^ dp h is bounded by i/f-jdp^ =±f 1 _^. 


From bound (4.7), magnitude (4.5), low-order terms for cross-product 
moments |E[ (p .-y, ) a (p .-y .) b | x] j for 2-a,b-8 from Appendix 4A, and results 

• ' J J 

(4.4) for E [ (p . -y . ) |x] for 1 odd, we would expect that, in general, 

E[ II (p. -y. ) | x ] = 0(n"^ 1+1 ^ 2 } for 1 odd. 

g=i h g h g - 

Note that for incomplete data, we can duplicate all complete-data 
results but one. We can not parallel proof from Appendix 4A that for odd 
1 of 3, 5, and 7 the cross-product moment is 0(n”^ + ^ 2 ). Although we 
expect this result based on all complete-data results and on incomplete- 
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density as a product of complete-data Dirichlet densities, each having, 
from Appendix 4B, a limiting multivariate normal distribution. Because 
these densities are of differing dimensions and on differing combinations 
of variables, we do not immediately have that the resultant product of 
these multivariate normal densities is a k-dimensional multivariate normal 
density on the k components of p. However, by equating coefficients and 
solving for unknowns, we then prove that, owing to the special relation- 
ship between the first and each remaining product, the sum of exponents 
from each Dirichlet in the product does form the exponent of such a density 
Following derivations in Appendix 4D, we have as final results for elements 
u.j , S 11 , and S 1J , respectively, of the asymptotic mean and inverse covar- 
iance matrices 

u i = ( z i + 1 z n u i /u r)) /n ’ (4.10) 

1 1 D3i 

s" = n(u 1 +u k+1 )/(u j u k+1 ) - ^ U 0 / UD )( VVl )/(u D u k+1 ) 

[»k+l (4.11) 

- E (V u D )(u D' U i )/<U j u D ) ' n f. (z D /u 0 )(u i +u k + l )/(u i u k + l>’ 

D2k+1 D3k+1 

and 

S lj = n/u.,,+ Z (z n /u n )/u n + l (z n /u n )/u n - E (z n /u n )/u. , , (4.12) 

k 1 D^k+1 u u u Q3k+1 u u u D3k+1 U 1 
D3i , j D£i ,3*j 

for D a set $ containing more than one element, "D3i,j" meaning the set 
D containing both i and j, and all conditions given under a summation sign 
to be met simultaneously [for example, the first summation sign in (4.12) 
means the sum over all sets D such that D^k+1 at the same time that Dsi,j]. 
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Note that expressions (4.11) and (4.12) for elements of the asymptotic 
inverse covariance matrix are simple [especially relative to expressions 
(4D.12) and (4D.13) given by the traditional derivation]. Furthermore, 
they parallel complete-data results given by the first term in each of 
expressions (4.11) and (4.12). [Note that once we have expressions (4.11) 
and (4.12) and thus know what to work toward, we show in lengthy reexpressions 
in Appendix 4D that results given by the traditional approach can be 
simplified to (4.11) and (4.12). Thus, the second approach might be use- 
ful in other kinds of problems to clarify and simplify any unwieldy results 
given by the traditional approach.] 

From (4.11) and (4.12) we have that elements of the asymptotic covar- 
iance matrix are 0(n _1 ). Thus, paralleling (4.5) we have that 

for 1 even, E[ n (p. -p. )|z] = 0(n _1 ^ 2 ). (4.13) 

g=l n g h g ~ 

Further, Lemma 4.1 holds for the case of incomplete data z as well as for 
that of complete data x. Therefore, again paralleling the case for complete 

i.L 

data, since Lemma 4.1 gives that the odd 1 un posterior- central cross-product 

s t 

moment is bounded in magnitude by the even (1-1) moment, from (4.13) the 
odd l*"* 1 moment is of magnitude no greater than 0(n~^ 2 ). Therefore, con- 
ditions for using Taylor-series expansions in Chapter 3 are satisfied. 

Note, from comparing (3.9) with (4.10), that asymptotically the Taylor- 
series approximate posterior mean equals the exact posterior mean. 
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4.3 Accuracy of the Taylor-Series Approximations : 

4.3.1 lntroduct_i oi n 

To determine the accuracy of the Taylor-series approximations of 
Chapter 3, we note that the only terms we approximated in the deriva- 
tions were moments of the ratios. Therefore, we calculate the error 
made in these approximations and then calculate the overall error made 
by substituting these approximations into equations (3.6), (3.14), and 
(3.15) for the posterior mean, variance, and covariance, respectively. 
We also apply results from Isaacson and Keller (1966) to determine the 
error made by iteratively solving the resulting equations and then 
using the solution to approximate the exact posterior central moments. 

4.3.2 Accuracy _of_ the. Taylor^enes^ ApjDroximatijw for. Posterior. Meanj_ 

The approximation for the exact posterior mean 

p = (z,+v.+ E z n p./p n )/m + e. (4.14) 

1 D3i 

obtained by dropping the error term is 

p = (z.+v,+ Z z n p,/p n )/m. (4.15) 

1 1 1 Dai u 1 u 

Rewriting (4.15) as a nonlinear system of equations yields the Taylor- 
series approximate posterior mean 

?i = (VV ^ . z D^i^D^ m ( 4 * 16 ) 


given in (3.9). 
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We now give the asymptotic error in using (4.16) to approximate 

(4.14) . We do so by first determining the error in approximating 

(4.14) by (4.15) and then determining the error in solving (4.15) by 
the EM iterative algorithm of Dempster, Laird, and Rubin (1977). We 
look for conditions under which an iterative solution to (4.15) agrees 
with (4.14) within some error bound. In the formulation of the itera- 
tive process, we rewrite (4.15) as (4.16). 

To determine the error in approximating (4.14) by (4.15), we must 
determine the accuracy of the approximation of each ratio r^ and its 
first two moments. These accuracies are given in Appendix 3B in terms 
of the 0 and o notations. Then, from (3.6) and (3.7), the exact 
posterior mean p^. can be written as 

p. = [z i +v i + D ^z D E(r iD |z)]/m 

= {z,+v.+ Z z n [p,/p n +0(n -1 )]}/m (4.17) 

1 1 Dsi 1 ■ 

= [Z.+V.+ Z Z n p./p n ]/m + 0(n _1 ). 
i i D3i u i v 

Hence, the error in approximating (4.14) by (4.15) is 0(n *). 

We next investigate how this error is affected by solving (4.15) 
by the EM iterative algorithm. To do so, we find two conditions in 
Appendix 4E whose satisfaction guarantees that 

|J i (s) -P i ] - 5/ ( 1-A) + A S [p 0 -6/(l-A)]. (4.18) 

In (4.18), 6 is a bound on the error made by approximating (4.14) by 

(4.15) and, hence, from (4.17) is of magnitude 0(n _i ). The term X is a 
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positive proportion less than 1; A differs from a constant by 0(n _1 ). 
Therefore, 6/(l-A)=0(n *). The term Pq is a constant. Since s can be 
made as large as desired, the right-hand term can be considered to be 0; 
in particular, it can be made at least as small as 0(n _1 ). Thus, from 
(4.18), when the two conditions given in Appendix 4E are satisfied, the 
error in approximating (4.14) by (4.16) is 0(n - ^); i.e.. 


Pi = Pi + 0(n _1 ). 


(4.19) 


The two conditions in Appendix 4E concern the region in which the 
initial iterative estimate is chosen and a bound on the partial deriva- 
tives of the right-hand side of (4.16) with respect to p.. If there 

J 

exists a neighborhood )| p-p || 00 <p » for p>0, of p such that for all 
probabilities in this neighborhood 

k . . 

max Z l3g,-(p)/3pj ~ X < 1, 

1-i-k j=l 1 ~ J 


for 

. . . k+1 

g i (p)=(z i + v i + Z z n p,/p n )/(n+ l V h ), 
1 ~ 1 1 D3i u 1 u h=l n 


• ( n ) 

and if an initial iterative estimate p..' ' is chosen within the inner 

• ' 

neighborhood || p-p jj oo <pQ-p-6/ ( 1- A) , where 6 is a bound on the error in 
approximating the exact posterior mean by a first-order Taylor-series 
expansion, then the iterative solution to the defining equation of the 
Taylor-series approximate posterior mean p will converge to within 6/(l-X) 
=0 (n _ 1 ) of the exact posterior mean. 

If a neighborhood of the exact posterior mean can be found in which 
the second condition is satisfied, then, for large enough sample sizes, 
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the first condition can be satisfied by choosing an initial iterative 
estimate in a neighborhood within the first neighborhood. For moderate 
percentages of incomplete data, the inner neighborhood is almost as 
large as the outer neighborhood. In Appendix 4E, we show how to deter- 
mine, in practice, whether the second condition can be expected to 
hold; hence, we show how to approximate the size of the outer neighbor- 
hood. Further, for incomplete trinomial data, Appendix 4E shows that 
a root of the defining equations of the Taylor-series approximate 
posterior mean that differs from the exact posterior mean by magnitude 
0(n”^) exists in P,,. 

However, this root need not be unique in P^. Moreover, as Ortega 
and Rheinbolt ( 1970, p2) illustrate with a simple case, a nonlinear 
system of k equations in k unknowns may have no solution or may have 
arbitrarily many solutions. Therefore, we now consider when the 
Taylor-series approximate posterior mean for incomplete data from the 
general k-dimensional multinomial distribution not only has a solution 
but also has a solution that is in P^ and that differs from the exact 
posterior mean by magnitude 0(n ^ ) . 

Because the Taylor-series approximate posterior mean can be written 
as a posterior mode, it will always have at least one solution in P^ when 
certain conditions, soon to be discussed, are met. However, none of 
these solutions may be in the epm convergence region, which we define as 
the region in which an initial iterative estimate can be picked so that 
successive iterates are guaranteed to converge to within a small error 
of the exact posterior mean. In particular, for k>2, there may not exist 
an epm convergence region. That is, there may not exist a neighborhood 
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of p such that for all probabilities p in the neighborhood 

k . . . ? 

max l |3g.(p)/3p.| = max E z /m{l/p +[8(D)-2]p./p n } < 1. (4.20) 

i j=l 1 ~ J i Dai u D 1 u 

As the number k of dimensions, the percentage 100xnz n /n of incomplete 

D D 

data, or the number 8(D) of variables sharing incomplete data increases, 
inequality (4.20) shows that this possibility increases. 

The most likely values of p not to have an epm convergence region 
are those in higher dimensions that have one or more components near 
zero and/or a component near 1 when the percentage of incomplete data is 
high. For example, consider incomplete multinomial data z,, z^, 
z 10’ z n* and z l***10 w ’ iere the Percentage of incomplete data is 
I00x(z^ #> ^/n)=50. Suppose that p^=.89 and p..= .01 otherwise. Further, 
suppose that the sample size n is large enough, or the sum Ev. of prior 
parameters is small enough, that ~ z i . - - lo^ m ~ Then » f° r 

probabilities p.^ near p. and D=(l,***,10}, one term in (4.20) is 

^ 1 ... 10 / m > (1 /P D (S)+ (B(0)-2]p 10 ts) /[P D (s) ] 2 ) 

= 0.5{1/. 99+8x.89/(.99) 2 } = 4.14 > 1. 

However, for probabilities having such small values for some components, 
results of Chapters 6 and 7 indicate that the posterior mean is a 
relatively poor estimator to minimize risk for quadratic loss; the 
posterior mode is much better. Hence, for this particular case, we 
do not have to be concerned with not being able to find an epm conver- 
gence region. This example illustrates, however, that the Taylor - 
series approximate posterior mean needs more study in higher dimensions. 
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si nee the largest factor, 8, in the last inequality is a function of 
the dimension of the simplex. 

When there does exist an epm convergence region, there can be a 
problem finding it because there may be multiple roots in of the 
Taylor-series approximation. In particular, there may be multiple roots 
in for which inequality (4.20) is satisfied. The problem then is 
choosing among these roots. Since the Taylor-series approximation can 
be written as 

. • . k+1 

p.- = [Z.+8.-1+ Z z n p./p n ]/[n+ z B .-(k+1)] , (4.21) 

1 1 1 Dai u 1 ■ j = l J 


where 3^=v.+l, the Taylor-series approximation is a posterior mode; i.e., 

(4.21) is in form (2.43). Thus, the Taylor-series approximate posterior 

mean enjoys the convergence properties of the EM algorithm. That is, 

define t . (x)=z.+v.+ E z n ^, 4>^=ln(p./p. , . ) , and t as the number of 
1 ~ 1 1 D3i u 1 1 K 1 

iterations required to meet convergence conditions. Then, since the 
multinomial distribution is a member of the regular exponential family, 
p v ' converges in P^ to at least a local maximum if the eigenvalues of 
cov [t(x) |cf>^ s ^] , 1-s-t, are bounded above zero. [See Section 2.3.] To 
find a global maximum, choose that root that maximizes the likelihood 
function 


• z^+v^-l* Z2+V2-I 
P[ P 2 



k+rvr 1 



From the complete-data relationship between the posterior mode and 
posterior mean, we intuitively expect the global maximum to be in the 
epm convergence region, or at least be the closest root to p. However, 
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this hypothesis has not been proved and needs study. As for the two- 
dimensional case, however. Appendix 4E proves that if a root is_ in the 
epm convergence region, then the error in using it to approximate the 
exact posterior mean is of magnitude 0(n _1 ). Note again that the 0 (n - 1 ) 
error mainly comes from using a first-order Taylor-series expansion to 
approximate the exact posterior mean. 

Observe, as we illustrate with examples in Appendix 4E, that the 
two guaranteed-convergence conditions are sufficient, not necessary. 

That is, an initial iterative estimate can fall far outside the epm 
convergence region and convergence to the exact posterior mean still 
occur to within the same small error incurred when an initial iterative 
estimate is chosen inside the epm convergence region. Moreover, as also 
exampled, the error bound given by Theorem 4E.1 when these two conditions 
are satisfied is extremely conservative. 

Finally, one should not pick as an initial iterative estimate a 
probability containing zero components because p^ corresponding to those 
components will be the same for all iterations. Further, any initial 
iterative estimate that has components near zero may cause the conver- 
gence process to be extremely slow for those components; see Section 
5.8.4 for an example. 

4.3.3 Accuracy of Taylor-Series Approximations for Posterior Covariances : 

For those categories i and h that have only complete data, there 

• • 

is no error in writing 5.. of (3.18) and of (3.19) for and o^, 
respectively. For those categories h having only complete data and 
those categories i having incomplete data, we have from equation (3.15), 
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Section 4.3.2, and Lemma 3B.1 that the error in writing a ih of (3.20) 

O 

for is 0 ( n ). For both categories i and h having incomplete data, 
we have a choice of approximating the variance and covariance by pro- 
cedures that are iterative or noniterative in a.. . 

i h 

For the iterative procedure, we first evaluate initial estimates 
5..^ of (3.21) and 5^°) of (3.22). Equations (3.14), (3.15), 

(3B.9), and Lemma 3B.1 yield that the error in these approximations 
a.. ' and a.. v ' is 0(n ), provided that parallel conditions from 

Appendix 4E for 5^. and 5^ can be satisfied. 

To calculate the error in making (3.16) and (3.17) iterative 
algorithms, we note from the form of (3.14) and (3.15) and from approx- 
imations (3B.9) , (3B.12), (3B.15), (3B.16), (3B.19), (3B.21), and 

• / c \ 

(3B.22) that the largest error for d ih v ' will come from approximating 
var(r iD |z) and cov(r iD ,r h q|z). [That is, the error in approximating 
terms multiplied by l/(m+l) in (3.14) and (3.15) is l/(m+l) times the 
error for those terms and in total is less than the error made in 
approximating var(r iD |z) and cov(r^ D ,r h g|z). ] At the same time, note 

from (3B.16) and (3B.22) that these errors are 0(n ). Thus, if 

• • 

parallel conditions from Appendix 4E are satisfied for and d^, 
then, recalling Lemma 3B.1, we have that the errors in approximating 5.^ 
and d ih by o^ v ' and a^ h v respectively, are 0(n ). 

The second procedure to approximate the variance and covariance 
for those q variables referring to categories that have incomplete data 
is a method that is noniterative in 5^. Recall from (3.23) that, 
for both i and j referring to categories having incomplete data, a^ 
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coefficients of a-|^, and a term that is not a function of for 


any 1 or h, in this procedure we write each a., as 

a. • = £ E [ E £ a, u 5-.. ] + b . . + 6 . . . 

1J D=ri Q*j l€D heQ 1h 111 1J 1J 


(4.22) 


In (4.22) 6 .. is an infinite series containing terms in E (e - n | z ) 

I vJ I ~ 

1 h 

and E(e.j e.|z) for 1 ,h-2 and e^p^-p^. Thus, some of these terms are 
in 5-j^ [for example, second-order terms in the approximation for 
E(r..p|z) are terms in a^. ]. Therefore, we can divide 6 ^ into a 
component 6 ^ containing terms in and a component 6 g containing the 
remaining terms. 

Doing so, we can write (4.22) as a linear system of q(q+l)/2 
equations in the q(q+l )/2 unknowns a^. and a^: 


[A + 6 A ] a = B [I + 6 b ] 1 


(4.23) 


where a is the q(q+l)/ 2 xl vector of a., for i and j both referring to 

~ ' J 

categories having incomplete data, A is the q(q+l)/2xq(q+l)/2 matrix 
of the a-| h , B is the q(q+l)/2xq(q+l)/2 matrix with on the diagonal 
and 0 's elsewhere, I is the q(q+l)/ 2 *q(q+l )/2 identity matrix, 6 ^ is 
the q(q+l)/ 2 xq(q+l )/2 matrix containing those terms in 6 .. that are 

1 J 

terms in a, < 5 g is the q(q+l)/ 2 xq(q+l )/2 matrix containing zeros on the 
off-diagonal and the remaining terms of 6 .. divided by b. . on the 

I J 1 

diagonal, and 1 is the q(q+l)/ 2 xl vector containing all l's. 

Now, from. (3B. 16) and (3B.22), the terms var(r iD |z) and 
c ° v ( r i d ’ r hQ I z ^ in ( 3 *14) and ( 3 - 15 ) contain no terms in 5^ and a^j 
that are not already included in A. The terms E! (r ^ p | z ) and E( r j 0 > r f 1 Ql£) 
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do, however; in particular, the first terms dropped from their Taylor- 
series expansions. Since these terms have coefficients that are con- 
stant with respect to the sample size n and since E(r i - D |z) and 
E ( r i D , r h q | z ) in (3.14) and (3.15) are multiplied by (m+l)" 1 =0(n~ 1 ) , 

by Lemma 3B.1 the component of 6. . that goes on the left-hand side of 

^ 3 

(4.23) is 0 (n ^)xc. .. Thus, all terms in 6. are 0(n~'*'). 

1 J 

To determine 6 R we first note from (3.14) and (3.15) that we can 
write (4.23) as 

[A + « A 1 ? = l [ l + l (4.24) 

for F=(m+1)B, because all terms in B come from those terms in (3.14) 
and (3.15) that are multiplied by l/(m+l). As discussed for the itera- 
tive estimate, the largest error in approximating terms in (3.14) and 
(3.15) comes from the 0(n” 3 ^ 2 ) error in approximating var(r iD |z) and 
cov(r i . D ,r h g|z) . As discussed following (4.23), this error contains no 
terms in a., and thus is in that part of 6. ■ that belongs to 5 R . Since 
the diagonal terms of <5 R are terms from 6^. divided by corresponding 
diagonal terms of B=(m+1)~* F=0(n”*) F, we have that 6 R . . 

= 0(n" 3/2 ) [ 0 ( n ) F^" 1 ] = 0(n' 1/2 ) F^." 1 = 0(n' 1/2 ). Recall that off- 
diagonal elements of are 0. 

Therefore, we can write (4.24) as 

(a,j ♦ Ofrf 1 )) 


(m+1) 



(4.25) 
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At this point, recalling from equations (3.16) and (3.17) the 

form of A and F, we substitute p..' ' for p^ in A and F, where p^ v ' 

* (si 

again denotes the converged estimate p^ ' ' from (3.11). We denote the 

• Jt 

resulting matrices as A and F, respectively. If the two conditions of 

Appendix 4E are satisfied, so that p^ v /= p.j+0(n ) for 1-i -k , then, 

• • 

from Lemma 3B.1, the error in approximating A and F by A and F, 
respectively, is 0(n - ^). 

In this case, (4.25) can be rewritten as 



(4.26) 


To solve for a, we must invert the coefficient matrix of a in 

(4.26), which matrix we assume to be nonsingular. To determine the 

~-l 

error made by approximating the result by A , we use the following 
lemma: 

Lemma 4.2 : If A and A are h-dimensional square matrices such that 

A=A+0(q) and A -1 and A’ 1 exist, then A" 1 =A" 1 +0(q). 


£ro°f_: 


Define A., and A., to be the cofactor of A. • and A.., respectively. 

1J 1 J ■ J ‘ J 

Then, A. .-A. .=0(q) for all i and j implies that A. .-A. --O(q) . Thus, 
ij ij ij ij ^ 

from Lemma 3.2, A. -A. .-A. .A. .=0(q) so that det(A)-det(A)= Z (ft. -A.. .-A. .A, . ) 

IJ IJ IJ IJ ~ ~ j = l ^J •‘■J A J 

=0 (q ) . Therefore, since a matrix inverse is the transposed matrix of 
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cofactors divided by the determinant, we have that, for q-q, 

A -1 = [det(A) ] -1 (A j j) 1 = [det(A)+0(q)] _1 ^.+0(3)) 1 
= det(A) (A i j ) ' + 0(q ) 

= A -1 + 0(q) . 

Thus, assuming that A and A exist (i.e., their determinants are 
not zero), solving for a in (4.26) and applying Lemmas 3B.1 and 4.2, 
we have that 




Therefore, for both i and j referring to categories having incomplete 

data, the errors in approximating the vector of o.. by the procedure 

^ J 

that is noniterative in 5.. are, like those of the iterative procedure, 

^ J 

of order 0(n"^). 
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4.4 Summary : 

In the first part of this chapter, we proved that the posterior 

central cross-product moments satisfy the conditions for a first-order 

Taylor-series expansion to be an accurate approximation of the exact 

posterior mean. We also proved that asymptotically the Taylor-series 

approximate posterior mean equals the exact posterior mean. 

In the second part of the chapter, we studied how fast the 

Taylor-series approximate posterior mean approaches the limit, the 

exact posterior mean, and then investigated the accuracy of the 

Taylor-series approximate posterior variance and covariance. We began 

by showing that the Taylor-series expansions for elements of the exact 

posterior mean and covariance matrices are accurate to order C^n* 1 ) and 
-3/2 

0 (n ' ), respectively. However, because the exact posterior moments in 
these expansions are then approximated, the errors in the final approxi- 
mations, which we called the Taylor-series approximations, are not 

-1 -3/2 

necessarily of magnitude 0(n ) and 0(n ), respectively. 

Nearly always, the Taylor-series approximate posterior mean will 
be evaluated iteratively. For this type of evaluation, we gave two 
sufficient conditions guaranteeing accuracy of the Taylor-series 

approximate posterior mean to the exact posterior mean within order of 

1 • 

magnitude 0(n ). If there exists a neighborhood ||p-p|| ro <p, for p>0, 

of p such that for all probabilities p in this neighborhood 

k . 

max E jag. (p)/3p. | £ X < 1, 

1-i-k j=l 1 ~ J 


for 
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• • • k+ 1 

g i (p)=(z..+v.+ E z n p./p n )/(n+ E v.), 

1 ~ 1 1 Dai u 1 u h=l n 

• (o) 

and if an initial iterative estimate p.' ' is chosen within the inner 
neighborhood li p-pH^Pg-p-d/ (1-X) where 6 is a bound on the error in 
approximating the exact posterior mean by a first-order Taylor-series 
expansion, then the iterative solution to the defining equation of the 
Taylor-series approximate posterior mean p will converge to within 
0(n”*) of the exact posterior mean. We also showed how to determine, 
in practice, whether these conditions can be expected to hold. 

Further, for incomplete trinomial data, we showed that there does 
exist a root in of the defining equations for the Taylor-series 
approximate posterior mean that differs from the exact posterior mean 
by magnitude 0(n _1 ). We then investigated when the Taylor-series 
approximation for incomplete data from the general k-dimensional multi- 
nomial distribution has a solution that differs from the exact posterior 
mean by magnitude 0(n _1 ). Because the Taylor-series approximate 
posterior mean can be written as a posterior mode, it always has at 
least one solution in if the eigenvalues of the covariance matrix 
of the complete-data sufficient statistics are bounded above zero. 
However, none of these solutions may be in the convergence region for 
the exact posterior mean {"epm convergence region"). In particular, 
for k>2, there may not exist an epm convergence region and we gave an 
example of such a case. In this example, many components of p were 
very small. Since results of Chapters 6 and 7 indicate that the 
posterior mean is a poor estimator to use to minimize risk for quad- 
ratic loss when components of p are very small, the posterior mode being 
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much better, absence of an epm convergence region was considered un- 
important for this particular cause (because the posterior mean would 
not be calculated). 

When there does exist an epm convergence region, there can be 
trouble finding it, because there may be multiple roots in of the 
defining equations for the Taylor-series approximate posterior mean. 

The problem then is choosing among these roots. We showed how to 
find one choice, the global maximum. Although it was not proved, from 
the complete-data relationship between the posterior mode and posterior 
mean, we intuitively expect the global maximum to be in the epm conver- 
gence region, or at least be the closest root to p. 

We also noted that the two guaranteed-convergence conditions, 
conditions given by Lemma 4E.1 on the initial iterative estimate and on 
the partial derivatives of the posterior mean, are sufficient but not 
necessary. We gave two illustrations in Appendix 4E where these condi- 
tions were not met but the iterates correctly converged. Further, as 
also illustrated, the error bound given by Lemma 4E.1 is extremely 
conservati ve. 

Finally, for those categories having only complete data, there is 
no error in using the Taylor-series approximation for the exact 
posterior mean. 

Recall that elements of the Taylor-series approximate posterior 
covariance matrix can be evaluated by procedures that are noniterative 
or iterative in elements of the posterior covariance matrix. The 
Taylor-series approximate posterior mean is used in both procedures. 

When the error in the Taylor-series approximate posterior mean is 0(n *), 
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then the noniterative procedure yields approximations for elements of 

-3/2 

the posterior covariance matrix that are accurate to order 0(n ). 

If, in addition to the 0(n accuracy in the Taylor-series approximate 

posterior mean, parallel conditions given in Appendix 4E are met for 

the Taylor-series approximate covariances, then the iterative procedure 

also gives approximations for elements of the posterior covariance 

-3/2 

matrix that are accurate to order 0(n ' ). Under these same condi- 
tions, when one of categories i and j has no incomplete data, then the 

_o 

error in the Taylor-series approximate variance and covariance is 0(n ). 

For both i and j having only complete data, there is no error in 
approximating the exact posterior variance and covariance by the 
Taylor-series approximate posterior variance and covariance, 
respectively. 
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APPENDIX 4A 

POSTERIOR CENTRAL MOMENTS GIVEN COMPLETE DATA 


4A.1 Introduction: 


In this appendix, we determine orders of magnitude for the pos- 
terior central moments. To do so, we prove by induction an expression 

k+1 , x.. 

for the lowest-order term in (n+ £ v.) 1 of the r posterior central 
1 h=1 

moment ECCp^-y^) j x ] - We first determine, in Section 4A.3, the 
expression for the first twenty-one central moments, enough moments to 
determine an algebraic pattern. Then, in Section 4A.4, we extend 
moment results from Kendall and Stuart ( 1969, vl,pl48- 150) for Pearson 
distributions, proving that if the expression is true for any two 
successive values of 1 , it must also be true for the next higher value 
of 1. 


We conclude Appendix 4A in Section 4A.5 by generalizing this method 
to cross-product moments. Order-of-magni tude results are given for 
forty-nine cases. Because of the variety and complexity of possible 
results for the lowest-order term in (n+Ev^ )”*, we do not further use 
this method. Instead, we describe a different approach in Section 
4.2.1 of the main text. Although the different approach gives orders 
of magnitude for even cross-product moments, it gives only bounds for 
odd cross-product moments. Hence, results of Section 4A.5 are especially 
important for odd cross-product moments. 

In Section 4A.2 we describe a symbolic computer system used to 
facilitate algebraic operations in the last three sections. 

Remark : The usual procedure to calculate moments is through the 
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characteristic (moment-generating) function, cumulants, or factorial 
moments. However, in this case, calculation of the posterior central 
moments (2.6) was easiest done directly. As might be unsurprising in 
such case, none of the three usual procedures aided in obtaining the 
limit of these moments. The moment-generating function led directly to 
expression (2.6) for the 1th posterior central moments; that is, differ- 
entiating exp(-t'y)<j)(t) with respect to t, for 4>(t) the moment-generating 
function, and setting results to 0 gives (2.6). Thus, use of the 
moment-generating function was not helpful in reexpressing (2.6) to 
obtain its limit. Calculation of the logarithm of the moment-generating 
function to obtain the cumulants (for purpose of translation back to the 
central moments) also did not aid in obtaining the limit of the 1th 
central moment (2.6). Consideration of factorial moments, often useful 
for discrete distributions, was unfruitful for this continuous 


distribution. 
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4A.2 Symbolic Computer System : 

In this section we describe a computer system used to facilitate 
algebraic operations in the remaining three sections. In Section 4A.3 
we use this system to expand the first twenty-one central moments, 
E[(p.j"y.j) 1 |x] for 1 £ 1-21, in a Taylor series in (n+Zv^ -1 about the 
point (n+Zv^) ^=0. In Section 4A.4 we use the computer system to alge- 
braically solve in terms of (n+Zv^) ^ and -y..=E(p Jx) a system of four 

• ct 

equations in four unknowns to enable, for all 1, the (1+1) central 

moment to be written in terms of the two preceding, 1 n and (l-l) s , 

moments. In Section 4A.5, the computer system facilitates evaluation of 

1 h 

cross-product moments E[(p.-y.) (p.-y.) | x] for 2-1, h-8. 

* 1 J J ' v 

The computer system used is MACSYMA* (Project MAC'S SYmbolic Manip- 
ulation System), developed by the Mathlab Group, Project MAC at M.I.T. 
(Massachusetts Institute of Technology). MACSYMA is a versatile inter- 
active computer system for manipulating algebraic or symbolic expressions 
as well as for performing high-precision numerical calculations. MACSYMA 
is written in LISP (a lis t jDrocession programing language used for non- 
numerical applications) for a Digital Equipment Corporation PDP-10 computer 
with a KL10 processor and 500k 36-bit words of memory. The PDP-10 computer 
is located at the Laboratory for Computer Science at M.I.T. and is known 
as the MC (MACSYMA CONSORTIUM) computer. A large variety of computer 
terminals at NASA, Langley Research Center, allow access to MACSYMA. 

* 

This work is supported by the Defense Advanced Research Projects Agency 
work order 2095, under Office of Naval Research Contract #N00014-75-C-0661. 
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MACSYMA can algebraically differentiate and integrate analytic 
expressions, take limits, solve systems of linear or polynomial equations, 
expand functions in Taylor series, manipulate matrices and tensors, factor 
complicated polynomials in many variables, plot functions, and calculate 
Laplace transforms. The system has "built-in knowledge" of many commonly 
used mathematical functions. Operations are done in rational, not floating- 
point, arithmetic. Thus, round-off error does not exist. Additional 
information can be found in MACSYMA manuals (1975a, 1975b, 1976) by the 
Mathlab Group at M.I.T. 
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4A.3 Derivation of General Expression: 


In this section we determine an expression for the lowest-order 
k+1 

term in (n+ £ v. ) 1 of the first twenty-one posterior central moments. 
h=l n th 

We do so by writing the i central moment [recall (2.6)] 


EKp-u^lx] 


1 i • 

£ (-1) ~ J 
j=0 


h 

1* i^i y 


n+£v. 

j 

\ h / 


j-1 x.+v.+q 
q=0 


(4A.1) 


k+1 , k+1 , 

in a Taylor series in (n+ £ v,) -1 about the point (n+ £ v h ) -0. 

h=l n h=l n 

Recall from (2.2) that 

k+1 

y. = (x.+v. )/ (n+ £ v.) (4A.2) 

1 11 h=l n 


and let 


1!! = 1 (1 -2) (1 -4) (1 -6) . . . 1 for 1 odd*, (4A.3) 

k+1 

r = (n+ £ v.f 1 , (4A.4 j 

h=l n 


k+1 

s i ®[n+ Z v h -(x i +v i )]/(x.+v i ) 


= ( 1-y.j )/y^ , 


(4A.5) 


and 


y i = p i“ y i 


(4A.6) 


Then 


*Standard mathematical notation. For example, see Gradshteyn and Ryzhik 
( 1967 , pxl i i i ) . Note that 1!! is not defined for 1 even. 
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s i + l = V*’ 

s. 2 -l = (l-2y i )/y i 2 , 
s i /(s i +l) 2 = (1-y.j ) , 

and the variance is given by 

o.- = rs 1 -/[(l+r)(s i +l) 2 ] = ry^l-y^ 

since 


r/(l+r) = l/(n+Ev h +l). 

J. L 

Rewrite the l tn central moment (4A.1) as 


l 1 1 1-i / 1 \ J" 1 l + q/(x-+v . ) 

|x) = p 1 E (-1) 1 J (1) n — — — 

' ' j=0 q=0 l+q/(n+Ev^) 


since 


l+q/(x.+v i ) q[-l/(n+Ev h )+l/(x i +v i )] 

l+q/(n+Zv h ) l+q/(n+Ev h ) 

(n+Zv h )-(x i +v i ) 


= 1 + 


n+Ev. 


x i +v i 


/[l+(n+Ev h )] 


= 1+ Si q 


1+qr' 


Define 


(4A.7) 

(4A.8) 

(4A.9) 

(4A.10) 

(4A.11) 


(4A.12) 


(4A.13) 


f(r) = r/(l+qr). 


(4A.14) 
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Expanding f(r) in a Taylor series in r about the point r=0 yields that 

oo 

r/(l+qr) = E (-l) j " V“M. (4A.15) 

j=l 

Substituting (4A.15) into (4A.12) and, by using MACSYMA, expanding 
P i 1 E(y i 1 |x) in a Taylor series in r about the point r=0 yields for low- 
order terms for the first twenty-one central moments results given in 
Table 4A.1. Note that all results must be multiplied by yJ. To get the 
lowest-order term in r, we discard all those terms in the inner-most set 
of parenthesis, except for cases 1=2 and 1=3. The following pattern is 
detected for 1-1-21: 

(1-1) ! !s.^ 2 r^ 2 y.^ for 1 even 

ECy, 1 lx) = / 1 (4A.16) 

|^(l-l)l!!(s i -l)s. (1 " 1)/2 r (1+1)/2 y i 1 /3 fori odd. 


where the approximation "=" in (4A.16) means that only the lowest-order 

term in r is given. As a check on these formulas, note that for 1=20 

and 1=21, (4A. 16) yields 19! !s i 10 r. 10 y. 20 and (20x21! !/3)(s.-l)s. 10 
1121 

xr^ y. , respectively, both of which agree with results in Table 4A.1. 

To simplify results we multiply numerator and denominator of 
(4A. 16) by (1+r) 1 ^ 2 for 1 even and by (l+r)^ 1+1 ^ 2 for 1 odd. Then, 
substituting into (4A.16) from (4A.2) - (4A.10) and again giving. only 
the lowest-order term in r=l/(n+Ev^) yields that, for 1-1-21, 


for 1 even 


E(yi'lx) = 


(l-l)!!aJ /2 


(l-l)l!!(l-2u 1 )a ii <1+1)/2 /[3g 1 (l-u i )] fori odd. 


(4A.17) 
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TABLE 4A.1 

LOW-ORDER TERMS* FOR FIRST 21 CENTRAL MOMENTS E[(p i -u..) S '|x] 


_A 

2 

4 

6 

8 

10 

12 

14 

16 

18 

20 


low-order term 


r s i (r - 1) 

3 r 2 Si (r( 2 s. 2 - 8 + 2) + s.) 

5 r 3 s i 2 (r( 26 s. Z r 79 s i + 26) + 3s i ) 

35 r 4 s- 3 (r(~68 s^ 2 - 184 s i + 68) + 3s.;) 

315 r 5 s i 4 (r(140 s- 2 - 355 s i + 140) + 3s.) 

3465 r 6 s. 5 (r(250 s. 2 - 608 s- + 250) + 3s.) 

45045 r 7 s- 6 (r(406 s. 2 - 959 s i + 406) + 3s i ) 

675675 r 8 s. 7 (r(616 s. 2 - 1424 s j + 616) + 3s.) 

34459425 r 9 (r(296 s. 2 - 673 s. + 296) + s^ 
654729075 r 10 s. 9 (r(410 S. 2 - 920 s. + 410) + s.) 


3 - 2 r 2 ( S - — 1 ) s. ( 3 r - 1) 

5 4 r 3 (s.-l) s i (r( 6 s.2 - 50 s. + 6) + 5s.) 

7 42 r 4 (s.-l) s. 2 (r( 22 s. 2 - 115 s. + 22) + 5 s.) 

9 56 r 5 (s.-l) s. 3 (r( 472 s. 2 - 1970 s. + 472) + 45 s^ 

11 770 r 6 (s.-l) s. 4 (r( 916 s. 2 - 3335 s. + 916) + 45 s.) 

13 60060 r 7 (s.-l) s^ (r( 314 s. 2 - 1042 s. + 314) + 9 s.) 

15 210210 r 8 (s.-l) s. 6 (r(2474 s- 2 - 7675 s. +2474) + 45 s.) 

17 4084080 r 9 (s.-l) s. 7 (r(3668 s. 2 - 10810 s. +3668) + 45 s i ) 

19 87297210 r 10 (s-1) s i ® (r(5192 s. 2 - 14695 s. +5192) + 45 s 1 ) 

21 6110804700 r 11 (s^l) s. 9 (r(2362 s- 2 - 6470 s. +2362) + 15 s^ 


* all results must be multiplied by 


l 
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4A.4 Validity of General Expression: 


In the last section we derived expression (A. 17) for the l tn pos- 
terior moment for 1-1-21, hence proving the expression true for these 
values of 1 . In this section, we prove that if the expression is true 
for any two successive values of 1 , it must also be true for the next 
higher value of 1. Having done so, we will have proved that expression 
(A. 17) holds for all positive integer values of 1. 

In (4A.1) we are calculating the l*'* 1 posterior central moment of 
p.j. Since the posterior distribution of p given x is the k-dimensional 
Dirichlet D(x^+v^,. . . ,X| < +v^-,Xj <+ j+Vj <+ ^) , then the marginal posterior 

distribution of p. for 1-i-k is the one-dimensional Dirichlet 
k+1 1 k+1 

D(x.+v. ; E (x.+v.)) or beta Be(x.+v. , E (x -+v - ) ) . [See Wilks (1963, 

1 1 jtM 3 3 1 1 j^i 3 3 

pl73- 179 ) . ] That is, the posterior density of p. given x is 

k+1 

k+1 k+1 x.+v.-l E (xj+Vj)-] 

f(p, |x)=r[ e (x.+v.)]/{r(x.+v.)r[ e (x.+v.)]}p. (i-pJ^ 1 

1 ~ j = l 3 3 1 1 j?i 3 3 1 1 (4A.18) 

Now, the beta distribution is known as one of the Pearson distri- 
butions [Kendall and Stuart ( 1969 ,vl ,pl48) ] . A Pearson distribution is 
defined as any frequency function f(w) for which 


df(w)/dw = (w-a)f(w)/(bQ+b^w+b 2 W^) 


for some a, bg, b^, and b^. Kendall and Stuart derive the general 
moment for a Pearson distribution in terms of lower-order moments. We 
now generalize their method to the case of central moments. Note that 
we treat the most common case, f(0)=f(l)=0. However, results also hold 
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when one or both of f(0) and f{l) are not zero. Thus, results also 
hold for J-shaped, U-shaped, and flat beta distributions. [See also 
Kendall and Stuart ( 1969 , vl , pl5 1 ) . ] 

Therefore, cross multiplying (4A.19), adding and subtracting 
powers of ri=E(w), and multiplying both resulting sides by (w-n)^ yields 
that 


(w-n)^ [(bg+nbi+n 2 b 2 )+(bi+2nb 2 )(w-n)+b 2 (w-n) 2 ] df(w)/dw dw 


(4A.20) 


= (w-n) 1 [ (w-r))+(n-a) ] f(w) dw. 

Integrating the left-hand side of (4A.20) by parts over the range of the 
distribution, we find, assuming that the integrals exist, that 


oo oo 

(w-n) 1 {[(b 0 +nb 1 +n 2 b 2 )+(b 1 +2nb 2 )(w-n)+b 2 (w-n) 2 ] f(w) } - / f(w) {1 


-oo -oo 


x(b 0 +nb 1 +n 2 b 2 ){w-n) 1 ' 1 +(l+l)(b 1 -2nb 2 )(w-n) 1 +(l+2)b 2 (w-n) 1+1 } dx (4A.21) 


11 1 
/ (w-n) +1 f(w) dw + (n-a) / (w-n) f(w) dw. 


For the beta density (4A.18), f(p i |x) is positive for 0-p^-l. Thus, 
in equation (4A.21) we replace endpoints - 00 and +<» by 0 and 1, respec- 
tively, and w, n. and f(w) by p. jx, y. , and f(p. |x), respectively. We 
then note that, since f(l|x)=f(0jx)=0 and, for any positive integer j, 
lim p. J =l and lim p, J =0, 

p.-H-l p.++0 


lim. p. J f(p. | x) = 0. 


J+i 1 

5 i L+0 




(4A.22) 


Therefore, the first term in equation (4A.21) is 0. 
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Hence, recalling from Section 4A.3 the definition y^p^-y^ , we 
can write (4A.21) as 


l[b () +b 1 y i +y i 2 b 2 ] E(y i 1_1 |x) + [ (1+1) (b^y. )+y r a] E(y i - 1 |x) 


1 + 1 , 


(4A.23) 


+ [(l+2)b 2 +l] E( yi |x) = 0 


Thus, if we knew b Q , bj, b 2 , and a we could use (4A.23) to calcu- 
late any (l+l) st central moment in terms of the 1 th and (l-l) st central 
moments. To calculate bg, b^, b 2> and a, we successively let 1=0, 1, 

2, and 3 in (4A.23), substitute results from Section 4A.3 for E(y^ J |x) 
for 2-j-4, and set E{y i ~ 1 |x)=0, E(y^ 0 1 x)=l , and E (y ^ 1 1 x ) =0 to obtain 
four equations in four unknowns bg, b^, b 2> and a. Solving these four 
equations with MACSYMA yields 


and 


bg = 4r/[(2r-l)(s i +l) 2 ], 
b l = >"( s i - 3 )/[(2r-l)(s i +l)], 
b 2 = - r/{2r-l), 
a = (rs i +r-l)/[(2r-l)(s i +l)]. 


(4A.24) 


Substituting results (4A.24) into equation (4A.23) and collecting 
terms in E (y^. 1 “ 1 1 x ) , E(y^|x), and E(y.j 1+1 |x) yields that 

E(y i 1+1 |x) = lr [( Sl . 2 -1) Efy^lx) + s j E(y j 1 ' 1 |x)]/[{l+lr)(s j +l) 2 ]. 

(4A.25) 


Therefore, we can use (4A.25) to show that if expression (4A.17) 

A. I_ — X 

holds for the r and { 1 - 1 ) ^ u central moments, it must also hold for 
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r 4* 

the (1+1) moment. Because we have different expressions for 1 even 
and 1 odd we have two cases. By using (4A.17), (4A.25), and (4A.2) - 
(4A.10), we have that, to the lowest-order term in r: 

for l+l evenj_ 

E(y. 1+1 |x) = lr[(s i 2 -l)(l-l)l!!(l-2u i )/3 m. . (1 " 1)/2 ( 1+r) (1 ~ 1)/2 
+s i (l-2)!!a ii (1 " 1)/2 (l+r) (1 " 1)/2 ]/[(lr+l)(s.+l) 2 ] 

(4A.26) 

= 1 ! !a ii ^ 1-1 ^ /2 (l+r)^ -1 ^ /2 {l (l-l)r 2 (l-2y i ) 2 /3+ry.j )}/(l+lr) 

= l!!a ii (1+1)/2 (l+r) (1+1)/2 (l+{l(l-l)(l-2p i ) 2 /[3y i (l-y i )]}r)/(l+lr) 

imo < 1+1 >/ 2 * 

' ‘ a i i 

fO£ 1+1 odd_:_ 

E(y i 1+1 |x) = lr[(s. 2 -l)(l-l)!!a ii 1/2 (l+r) 1/2 +s i (l-2)(l-l)!!(l-2y i )r/3 
x a .. (1 " 2)/2 (l+r) (1 " 2)/2 ]/[(lr+l)(s i +l) 2 ] 

(4A.27) 

= l(l-l)!!a. i 1/2 (l+r) 1/2 [r(l-2y i )+(l-2)(l-2y.)r/3]/(l+lr) 

= l(l-l)!!o. i 1/2 (l+r) 1/2 (l-2y.)r/3 (3+1 -2)/ (1+1 r) 

= 1(1+1)! !a ii {1+2)/2 (l-2y.)/[3y i (l-y i )]*. 

Therefore, from results of Sections 4A.3 and 4A.4, expression 
(4A.17) is true for all positive integer values of 1. 

★ . . 

These expressions are actually divided by a finite constant c ( 1 ) where 
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r 

1 if l<n+Ev^ 
c(l ) = < 2 if l=n+Ev^ 

>2 if l>n+Zv. . 

L h 

The constants c ( 1 ) arise from evaluation of the term 5=(l+r)^ +1 ^ 2 /(l+lr) 
Since r=l/(n+Zv^), then r<l. Hence, the numerator of 5 can be accurate- 
ly approximated by the first two terms of the series expansion 


j=0 


(l+l)/2\ J 


J / 


r /j ! 


l+[ (l+l)/2]r+[ (l+l)/2] [ (l+l)/2-l]r 2 /2 !+• • • . 


When lr<l (i.e., 1<n+Zv^), then the term l/(l+lr) in £ can also be accu- 
rately approximated by the first two terms 1-1 r of a series expansion 

O O 

l-lr+(lr) -(lr) +•••. In this case, i- can be accurately approximated 
by l+[(l+l)/2-l ]r, the low-order term in r resulting from the multipli- 
cation of the two series. Therefore, in this case of lr<l, expressions 
(4A.26) and (4A.27) are correct as given. When lr=l (i.e., l=n+Zv h ), 
however, then £=(l+r) ^ 1+1 ^ 2 ={l+[ ( 1 +1 )/2 ]r }/2 and expressions (4A.26) 
and (4A.27) must be divided by 2. When lr>l (i.e., l>n+Zv h ) , then 
£>(l+r) 0 + l)/2/2 so that expressions (4A.26) and (4A.27) must be divided 
by some constant larger than. 2. 

However, the interest in Chapter 4 is in very large n; in 
particular, the limiting case n-*». For these cases, l<n<n+Zv^ and 
c(l )=1. Therefore, to avoid carrying around a term that is 1 in the 
cases in which we are interested, we do not include it. Further, the 
limit taken in (4.3) in the main text is not affected by c (1 ) . 


4A. 5 Cross-product Moments 


The method of the preceding sections readily extends to cross- 

1 

product central moments E( n y. |x). We can write these cross-product 

9=1 9 ' 

moments as a nest of expressions where each expression is similar in 

j. i_ 

form to the l Ln central moment (4 A. 12) with the exception that each term 
of the sum is multiplied by results of inner nests. 

For example, for l-i,j-k; j^i , 1, h positive integers; and, again 
Ui=( x i+Vi )/ (n+Ev h ) and y.^tp.-iVjl*; we have that 


E(y 1 1 y j h |x) = E< 


j o <-i>’-* $ v- p, a 


= P i , M i h I I (-l) 1_a (-l) h_b 

1 J a=0 b=0 


(!) (!) 


(4A.28) 


[v u i' 

-a 

f x j +v j) 

r £v hj 


n+Zv^ 

l . i 


-b 


a-1 


b-1 


n (x.+v.+q.) n (x.+v.+q.) 
q.=0 1 1 1 q.=0 3 3 3 

a+b-1 ' — 

n (n+Zv h +q b ) 

% =0 


since 


a b, r(x i +v i +a)r(x.+v.+b)r(n+Zv h ) 
E ^ p i p j = r(x.+v. )T(x .+v . )T(n+Zv. +a+b) ‘ 

11 J J *■ 


-1 


(4A.29) 


In (4A.28) we again use the convention that n f (q)=l for any function 

q=0 
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where, again, 

r = (n+£v h )"*. 
and 

Sj = [(n+Zv^J-Cxj+VjJl/fXj+Vj) * (l-M i )/u i . 


(4A.30) 


The first term of the last line of (4A.30) was derived in (4A.13) and 
the second term has a similar derivation. 

Therefore, we can write (4A.28) as 



Using MACSYMA, we can evaluate the (4A.31) factor in braces for 
enough values of h to establish a pattern for the low-order term in r 
for h even and odd. We can then use the method of Section 4A.4 to show 
that this pattern is valid for all values of h. The procedure can then 



-115- 


be repeated for the remaining factor of (4A.31). 

Because we will have cross multiplication between the two factors 
of (4A.31), however, we must know not only the lowest-order term in r 
but also the next lowest-order term in r. In general, for each addi- 
tional variable y in (4A.31) we must know an additional low-order term 
m 

in r. 


Further, the two lowest-order terms in r for the (4A.31) factor in 
braces will be a function of "a" from the first factor, so the final 
result for (4A.31) in terms of the low-order term in r will be more 
complex than that of (4A.17). 

Therefore, the variety of possible results and greater complexity 
of intermediate evaluations, especially those of pattern recognition 
and algebraic manipulations, make this method generally unfeasible for 
cross-product moments. Hence, we adopt another approach, to be dis- 
cussed in the main text, to evaluate the magnitude of cross-product 
moments. 

We conclude this section by noting that evaluation of (4A.31) for 
2*1 , h*8 yields that 


E{[p 1 -ii 1 ! 1 [Pj-tij] h |x} ■ 


|V (1+h)/2 ) 

[_0(n-( ,+h+1 >/ 2 ) 


for 1+h even 

(4A.32) 

for 1+h odd. 
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APPENDIX 4B 

LIMITING POSTERIOR DISTRIBUTION GIVEN COMPLETE DATA 


Cox and Hinkley (1974, p399) prove in general that when the data has 
an exponential-family distribution and the conjugate prior is used, the 
limiting posterior distribution is multivariate normal with the vector of 
maximum likelihood estimates for the mean. The inverse covariance matrix 
of this limiting distribution is the negative matrix of second partial 
derivatives of the log likelihood evaluated at the maximum likelihood 
estimates. In this appendix we prove this theorem in detail for our complete- 
data case where the data has a multinomial distribution and the conjugate 
prior, the Dirichlet, is used. 

From (2.1) the posterior distribution of p given complete data is 
k-dimensional Dirichlet. Therefore, we prove that the limiting Dirichlet 
is k-dimensional multivariate normal with mean and covariance matrices those 
of the Dirichlet. We proceed by proving that the log Dirichlet converges 
to the log multivariate normal as the sample size indefinitely increases. 
Important aids will be Stirling's approximation for the logarithm of the 
gamma function and theorems from Graybill (1969) on patterned matrices. 

From (2.1) the posterior density f(p|x) of the k-dimensional variable 
p given complete data x is, in the notation of Wilks (1963, pl78), that for 
the Dirichlet distribution D(x^+Vp . . . .x^+v^; X| <+ ^+v {<+ ^ ) ; i.e., 

k+1 k+1 k+1 x.+v.-l 

f ( p | x ) = {r[ E (x h +v h )]/ n r * x h + V } n P h » (*B.l) 

h=l h=l h=l 1 

( 

where x^ is the number of observations falling in category C^, is the 

k 

real, positive parameter for the prior density (1.1) of p, and p^^l- 2 P^- 
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As the sample size n= £ x, increases, x./n approaches a constant and 

h=l " 1 

v^/n approaches zero. 

For the Dirichlet D(x^+v^, . . . »x k +v k ; x k +l +v k+l)’ reca11 (2l2) " (2 * 4) 
that the mean vector y of p|x has elements, for l^i^k, 

y. = (x.+v. )/{n+ E v. ^ ( 4B, ‘ 

1 1 1 h=l h 

and that the covariance matrix Z=(cr..) of p|x has elements, for 1-i-k, 

~ I J ~ ~ 

k+1 

a = y (1-y )/(n+ Z v. +1) (4B.- 

11 11 h=l " 


and, for i<j-k. 


k+1 

a-. = y y /(n+ E v.+l). 
J 1 3 h=l h 


Now, (n+Ev.+l) Z is of a matrix pattern treated by Graybill (1969), 
n ~ 

who gives its determinant and inverse. Applying Theorems 1.5.4 and 8.4.3 
of Graybill (1969, p8, 184) to (n+Ev.+l) Z yields that 


det(Z) = n y ./(n+Ev +1) 


i = l 1 


for "det" denoting determinant. Applying Theorem 8.3.3 of Graybill (1969, 
pl70) yields that the inverse £ ^ = (c^) of E has, for 1-i-k, elements 


and, for i<j-k. 


a 11 = (n+Ev h +l)(y.+y k+1 )/(y i y k+1 ) 


= (n+Zv h +l)/y k+ j. 


(4B.6) 


By dropping the term 0(q" ) in Stirling's formula [Cramer (1951, pl30 ) ] 


log [ r (q ) ] = (q-*s) log (q) - q + Vlog(2Tr) + 0(q ) 
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for the logarithm of the gamma function for q positive and real, we can 


approximate the logarithm of the Dirichlet density (4B.1) as 


logCf (p I x) ] 


( n+Zv^-h) 1 og ( n+Zv h )+%l og ( 2tt )- 


k+1 

Z 

i=l 


[(x.+v.-%)log(x.+v. ) 
LV 1 i ' 3 i i 


k+1 

+%log(27r)]+ £ (x.+v.-l)log(p.)+ 0(n 
i=l 1 1 1 


-1 


) 


k+1 k+1 

= - Z (x.+v.)log{(x.+v )/[(n+Zv )p.]}+Jslog{ n (x.+v. 
i=l 1 1 11 " 1 i=l 1 1 


(4B.9) 

) 


k k+1 -1 

/[ (n+Ev h ) (27r) K ]}-log( n p.) + 0(n i ). 


i=l 


Now, for 1-i-k+l, let 

z i = (p.-y.i/Zo.. (4B.10) 

where we define y k+1 and a k+1 k+1 by (4B.2) and (4B.3), respectively. 
Then, for l^i^k+1. 


E(z.) = 0, (4B.11) 

var(z.) = 1, (4B.12) 

and, for l^i^k, i-j-k, 

cov(z.,z.) = a. ./(a. .o . .) i§ . (4B.13) 

i j ij n jj 

Thus, from (4B.10), 


p. = y.+z./a. . 
i l n 


> u, (Hz.{(l-v i )/[M 1 (ntSv h +l)]} !s ). 


(4B.14) 


-119- 


From Tchebychev's inequality [Bishop, Fienberg, and Holland 
( 1975 ,p476 ) ] and (4B.3), 


Pi - y, = Op^ii) 




(4B.15) 


Thus, the term 


e i = z i /(l-y i )/[y i (n+Zv h +l)] 

= (P i -u i )/y i 


(4B.16) 


in the second line of (4B.14) is Op(n -is ). Therefore, for large enough 
n, (4B.16) is bounded in absolute value by 1, so that [CRC Tables (1962, 
p373) ] , 


log [1+e.j ] = e.-e^/2+o (n *) 


(4B.17) 


Hence, from (4B.2), (4B.14), and (4B.17), we have that, for 1-i-k, 


log y i /p i = -log [ l+e 1 ] 

= -e.+e i 2 /2+o (n' 1 ) 


(4B.18) 


Therefore, substituting (4B.18) into (4B.9), we have that, 

k+1 ? k+1 

log f ( p | x) = Z (x.+v.)[s r e. /2] +% log{ n (x.+v.) 

i=l 1 1 1 1 i=l 1 1 

k k+1 -1 

/[(n+zv. ) (2 tt) k ]} - log ( n p.) + o (n A ). 

n i=l 1 p 


(4B.19) 
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For the first of the four terms in (4B.19) we have, using (4B.3) and 


(4B.10), that 


k+1 k+1 

Z (x,+v.)e. = (n+Zv. ) Z (p.-y,) 
i=l 1 1 1 n i=l 1 1 


(4B.20) 


For the second term in (4B.19) we have that 
k+1 k+1 

- Z (x.+v.)e 2 /2 = {-h Z z . 2 ( 1-y. ) ] [ ( n+Zv h ) / ( n+Zv, +1 ) ] 
i=l 111 i=l ' 1 n n 

k + l 2 -1 

= Z Z. (1-y. ) + 0 (n l ) 
i=l 1 1 p 


(4B.21) 


2 

since, from (4B.15) (or meaning of standardized variable), z^ =0 p (l) so 
that 0(n _1 ) Zz.j 2 (l-y) = 0 p (n -1 ). 

2 2 

In (4B.21) we can write z k+1 (1-y^+j) in terms of z. and y.. for 


1-i-k as 


’^k+l^ ~ ( n +Zv h + 1 )(Pk+i-y k +i) /y k+l 

k 2 

= (n+Zv h +l)[ Z^Cy.-p^)] /y k+1 
k _ k-1 k 

— 11 \ r V \ <- J. 9 V V 


(4B.22) 


= (n+Zvi+1) [ Z (y--p-r + 2 Z Z (y.-p. ) (y^P-) ]/y k+1 
i=l 1 1 i=l j>i 1 J 3 K 1 

k k-1 k 

- i* ( 1-y ■ )y.j /y k xi ^ ^ z i z i 

i=l 1 1 1 Ktl i = 1 j > i 1 J 


x[y 1 (l-y i )y j (l-y j )]7y k + 1 


(n+Zv h +l)(y i -p i ) 2 = 


= (y i -p i ) 2 {(n+Zv.+l)/[y. (1-y. ) ] }yi ( l-y^ ) 

, l i \ 1 (4B.23) 

= z.‘y.(l-y.) 
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and, similarly, for j^i 

(n+Ev h +l)(y i -p i )(y J .-Pj) = 3^. (4B.24) 

Therefore, substituting (4B.22) into (4B.21) and recalling (4B.6) 
and (4B.7), we have that 
k+1 o k « 

~ -E 1 (x i +V i ) e i 12 = '^.E^i ( 1 ' u i )( 1 +v i /v k+l ) 

+2 i^ + ° P (n ’ 1) 

k ? 

= E^(p i -y.)^(u i +U k+1 )/(y i U k+1 ) (4B.25) 

k-1 k 

+2 z x (Pr^^Pr^i^k+i] + °n^ n_ 1 ^ 

i=l j>i 1 1 J J K 1 p 

= -h (p-y) Z -1 (p-y) + 0 p (n -1 ) . 

Now, from (4B.15) we have that 

k+1 k+1 , 

^p. = .n i y i +O p (n" 2 ). (4B.26) 

Therefore, by using (4B.2), (4B.5), and (4B.26), we can write the 
last two terms of the log Dirichlet (4B.19) as 
k + l j, j, k /p ^+1 

log{[ n (x.+v.)] 2 /[(n+Zv. ) 2 (2 tt) k ' 2 H p.]} 
i=l 1 1 n i=l 1 

k+1 

= log{(2Tt) k/2 [(n+Zv h +l)/(n+Zv h )] k/2 [( n y i ) % +O p (n"^ 2 ) J/ln+Zv^l)^ 2 }" 1 

1=1 (4B.27) 

= log ((2Tr) k/2 [l+0(n' k/2 )]{[det(Z)]^+0 p (n' (k+1)/2 )})' 1 

= log {(2Tr) k/2 [det(Z)3 J5 [l+O p (n" % )]}" 1 . 
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Therefore, substituting (4B.20), (4B.25), and (4B.27) into (4B.19), 
taking the antilogarithm of the result, and noting that 

exp[O p (n -1 ) ] = l+OpCn" 1 ), (4B.28) 

we have that 

f(p|x) = {(27r) k/2 [det(E)3^1+Op(n^)]r 1 {exp[-H(p-y)r 1 (p-y) , ]}[l+O p (n" 1 )] 

(4B.29) 

= {(2Tr) k/2 [det(z)]V 1 exp[-J 5 (p-y)z" 1 (p-y)] [ l+0 p ( n"^) ] . 

Rao [ 1968, (xv)pl04] proves that if the density of a random variable 
converges to some density, then the distribution of the random variable 
converges to the distribution for the limiting density. Therefore, 

lim DUj+Vj,. . . • x | ( + v k ; X k+1 +V k+P = N k ( H ); (4B.30) 

n-*=° 

that is, the limiting k-dimensional posterior distribution of p given 
complete data x is k-dimensional multivariate normal with mean and 
covariance matrices those of the Dirichlet. 


f 
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APPENDIX 4C 

CENTRAL MOMENTS OF k-DIMENSIONAL MULTIVARIATE NORMAL DISTRIBUTION 
4C.1 Introduction: 


Let x 1 ,...,x k have the k-dimensional multivariate normal distribu- 
tion N k (y,E) with the l*k mean vector y and kxk covariance matrix 
Z=(a. .). Anderson (1958, p39) gives the second and fourth central moments 

I J 

of this distribution. Lindley ( 1965 , vl »p95) and Schmetterer (1974, p76) 
give the 1th central moment for the one-dimensional distribution. In 
this appendix we derive the general central moment of the k-dimensional 
distribution. We conclude the appendix by illustrating the formula for 
the first six central cross-product moments and by showing it equals 
formulas from Anderson, Lindley, and Schmetterer for their specialized 
cases. 

Because the moment-generating function of the multivariate normal 
distribution exists, we work with it rather than the characteristic 
function to avoid using the extra, complex, variable /-l. To obtain 
central moments, we multiply the moment-generating function <j>(t^,. . . ,t k ) 
by exp(-ty’), continuously differentiate the results with respect to t, 
and then set t to 0 in the differentiated results. [See Lindley (1965, vl, 
p92) or Jeffreys ( 1939 ,p74) . ] 

An alternative approach is to calculate cumulants tc. . and then, 

Y "’k 

from the cumulants, central moments. Straightforward calculation yields 

results of Anderson (1958, p39) that q q=Pj for 1-j-k, 

J 

k 


K 0...0i •0...0i 1 0...0 =a jl 

J * 


for 1- j , 1 -k , and k. 


. =0 for l i >2. 
’’k j=l J 
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From Kendall and Stuart (1969,vl,p70) we can therefore write the first 
ten central moments E[(p_.-u..)^ |x] for 1-1-10. The method to extend 
these results to the general central (cross-product) moment, however, is 
no briefer than the method using moment-generating functions that is 
given in the next section. 
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4C.2 Theory 


From the characteristic function given by Anderson (1958, p36) and 
Wilks ( 1963 , pl68) , we can write the moment-generating function 4>(t^,... 
t k ) of the k-dimensional multivariate normal distribution as 

t y' + J* t Z t' 


<j>(t, ,. . . ,0 - e 


(4C.1) 


Defining 


we have that 


-t y' 

f(t, tj = e ~ ~ ^(t! t.), (4C.2) 


*s t Z t' 


f(ti»...,t.) - e 


k g k-1 k 


h( Z t. c a,, + 2 Z Zt. t. a..) 
= e i=l 1 11 i=l j>i 1 J 1J 


(4C.3) 


Hence, 


9f(tj t k )/3t i = (t^ o^+ E t. a^) f(tj t k ). (4C.4) 


Define 


and rewrite (4C.4) as 


C . = t. a. . + E t • a. . 
1 1 11 J 


(4C.5) 


3f(t^ ,. . . ,t k )/3t. = Cj f ( tj , . . . » t k ) • (4C.6) 


Now, for all l*i ,j £ k, 


3C i /9t j = a ij* 


(4C.7) 
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Then, 

3 2 f(t : t k )/(at-3t j . ) = a.. f(tj t k ) + c. af(t 1 ,...,t k )/at j , 

(4C.8) 

3 f » • • • »t k )/( 3t^ 3tj3t k ) = 3f(t^ . >t k )/3t k +a i<k 3f (t^ ». . . *t k )/3tj 

(4C.9) 


+ C. 3^f(tj t k )/(3tj3t k ) 


and 


3 4 f(t 1 ,...,t k )/(3t.3t j 3t k 3t £ ) = 

+ 

+ y/oyv 


(4C.10) 


+ C.S^tj t k )/(3t-3t k 3t Ji ). 


Continuing in this fashion, we have in general that, for l*h g -k 
and m a positive integer, 

o ro £ » « Z 

3 f(t, t k )/( n at. ) = u h h. a 10 ‘f(t, t.)/( n 3t. ) 

1 K g=l n g i=2 n l 1 1 K m=2 n m 

jL (4C.11) 

+ C h 3 A_1 f(t 1 t k )/( n 3t h ) 

' 1 m=2 m 

We use the double subscript hg rather than a single subscript h 
because t h is meaningless for k<h*£ and we want a convenient way of 
allowing all possible permutations of the k integers and their 
powers j for 1-j-Z. 

Now, odd central moments are 0 because the multivariate normal 
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distribution is symmetric about the mean. 1 For £ even and 


y h n = v - v 

g g g 


(4C.12) 


,th 


we therefore have from (4C.11) that the £ central moment E( n y. ) 

g=i n g 

is 


E < Vh > ' V'< " V 

g=l g i=2 1 i im=2 m 

ntfi 


,£- 2 , 


t=0 


,th 


(4C.13) 


Therefore, each of the £-1 terms in the £ central moment is 
a variance or covariance times an (£-2) na central moment. Evaluating 
the second central moment (4C.8) at t=0 yields that 

2 

E( n y h ) = a h h . (4C.14) 

g=l "g n l n 2 


th ** 

Hence, by induction the £ central moment E( n y, ) is a sum of £-1 

g=i g 

terms, each of which is a product of those £/2 elements of the 
covariance matrix that are indexed by the subscripts h g . That is, for 


(i) ■ 1. 


(4C.15) 


and 


/ 2j-l _ cj-i C J~ C _ \ 

(ii) i,. . = min 2x n 6. «,3x n 8. . ,(2j-2)x n 8. ? . ?,2j-l)j, 

' b=2 V* b=2 V d b=2 V ZJ ~ Z 1 

(4C.16) 


where 2-j-£/2 and 


^lso because C 0 (0) = 0; f ' (0) = 0, and by induction all odd moments 


are zero. 



6. =1-5. 

V q v q 



(4C.17) 


is defined to be the one complement of the Kronecker Delta symbol 


6 . 


'b* 1 


[see Feller (1968,vl,p428) , Korn and Korn ( 1968 »p544) , or CRC 


Standard Math Tables ( 1962 ,p501 ) ] , we have that 
£ £ 
g=i g 


A 4 Av 

E( n Yu ) = £ a. h 2 a. . 

9=1 h g io=2 h i, h io U>u h iJV i 


2 '4" '3 

i 4^ i j 
for j<4 


'3 '4 


n . h. 


£-4 >1 £-3 V3 1 £-4 

ws- 

for j<£-3 


1 

Z 


£ 

Z 


z 

i«>i 


r ‘£-i 

for j<£-l 


i h i 

'£-1 h 


Q(3) 

£ 

Q(5) 

£ 

i 

Z 

Z 

Z 

V 2 

1 4 >i 3 

i 5 =3 

1 6 >i 5 

V’j 

i 4^ 1 j 



for j<3 

for j<4 

for j<5 

for j<6 


(4C.18) 


Q(2s-1) 

I 

i 2 s-i= s 
^s-l^'j 


£ 

Z 

1 2s >1 2s-l 
i 2s^ i j 


for j<2s-l for j<2s 


QU-1) 

i £-l = £/2 


£ 

Z 

i„>i, 


£' '£-1 
£^J 

for j<£-l for j<£ 


Vi. 


h i h i 
] 3 U 


• • • 0 


h_- h. ' 
2s-l n 2s 


i £-l^ i j 


• a. 


where 
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2s-2 2s-2 2s-2 

Q(2s-1) = s + Z 6. + Z Z 6. 6. . 

j=2 1 j * s a=2 b=2 ’a* 5 V s 1 

b^a 


2s-2 2s-2 
+...+ E Z 


2s-2 2s-2 

E 


n 6 . . , 

V 2 W 2 W 2 b=3 V 

W j s ^2s-2^ a 

for a<2s-2 

for 6. _ the Kronecker Delta symbol. For example, 

I « j 5 


and 


Q(3) =2+6. ? 

1 2 

4 4 4 

Q(5) = 3 + E 6. - + E E 6. , 6. .. 

j=2 'j- 3 j 3 =2 j 4 =2 1 jj- 5 

V j 3 


(4C.19) 
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4C.3 Illustrations 


from (4C.18) the first six central moments of the k-dimensional 
multivariate normal distribution N k (y,I=(0. .)) are, for l-a,b,c,d,e,f-k, 


5 


E(y a ) - E(y a y b y c ) = E( n y h ) = 0, 

9=1 9 

(4C.20) 

E (ya y b> * °ab* 

(4C.21) 

E( >'a y b y c J, d ) = °ab°cd + °«c°bd + c ad°bc* 

(4C.22) 


and 

E <WcWf> * °ab [0 cd a ef +a ce 0 df +0 cf°de 1 

+ o ac [0 bd o ef +a be o df t< ’bf°de 1 + 0 ad to bc 0 ef t< ’be‘ , cf +0 bf°ce ) 

+ a ae (o bc°df +a bd a cf +a bf 0 cd I + V [0 bc' , de +o bd°ce + ‘’be 0 cd ] ' 
Thus, for N 3 (u,Z), 

E(yj y 2 y 3 ) = + 2a 12 a 13 


(4C.23) 


(4C.24) 


and 


E(y x 6 ) = 15a n 3 . 


(4C.25) 


We note that the second and fourth moments agree with Anderson 
(1958, p39) and that the 2/th central moment for l even is 

E(y. £ ) = [ £ n 2 (2j-l)] a.. 1,2 = o.. UZ l\/[ 2 £/2 (A/2)!], (4C.26) 

i j=1 11 11 

paralleling results for the one-dimensional normal distribution from 
Lindley (1965,vl,p95) and Schmetterer (1974,p76). 
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APPENDIX 4D 

LIMITING POSTERIOR DISTRIBUTION GIVEN INCOMPLETE DATA 
4D.1 Introduction : 

In this appendix we calculate the limiting k-dimensional posterior 
distribution of p given incomplete data z. We calculate this limiting 
distribution in two ways. In the traditional approach, given in Section 
4D.2, we note that the prior density is continuous in p and the likelihood 
is regular. Therefore, the limiting distribution for the posterior density 
f (p | z ) is multivariate normal with the vector of maximum likelihood esti- 
mates for the mean. The negative matrix of second partial derivatives of 
the log likelihood, evaluated at the maximum likelihood estimate, is the 
large-sample inverse covariance matrix. Therefore, rewriting the log 
likelihood in terms of exponential parameters, using theory from Sundberg 
(1974) to calculate the first and second partial derivatives of the log 
likelihood with respect to these exponential parameters, and transforming 
results back to p gives elements of the asymptotic posterior mean and 
covariance matrices. However, results for the asymptotic inverse covariance 
matrix are very long and complicated expressions that do not easily simplify. 

Therefore, to obtain simpler expressions and, moreover, equations 
paralleling those for complete data, in Section 4D.3.we also derive the 
limiting posterior distribution another way. We rewrite the posterior 
density as a product of complete-data Dirichlet densities, each having, 
from Appendix 3B, a limiting multivariate normal distribution. Because 
these densities are of differing dimensions and on differing combinations 
of variables, we do not immediately have that the resultant product of 
these multivariate normal densities is a k-dimensional multivariate normal 



-132- 


density on the k components of p. However, by equating coefficients and 
solving for unknowns, we then prove that, owing to the special relation- 
ship between the first and each remaining product, the sum of exponents 
from each Dirichlet in the product does form the exponent of such a density. 

As part of this proof, we check that the k-dimensional inverse matrix 
in the exponent is positive definite and symmetric; hence, a covariance 
matrix. We also obtain the nonexponential term for the limiting multi- 
variate normal distribution and prove that the limit of the denominator of 
the posterior density (the marginal distribution) is 1. The essential step 
for the latter is that the limit of the integral that is this denominator 
can be taken inside the integral. 

We begin this nontraditional approach by first considering, in Section 
4D.3.1 the case having at least one category, say C k+1 , for which all data 
is complete. For this case we derive elements of the asymptotic mean and 
inverse covariance matrices as functions of a number of unknown ratios and 
as a large nonlinear system of equations. In Subsection 4D.3.2 we rewrite 
the moments to eliminate these ratios and reduce the nonlinear system of 
equations. As results, we get the maximum likelihood estimate for the 
asymptotic mean and very simple expressions for the asymptotic covariance 
matrix that parallel expressions from the complete-data case. In Subsection 
4D.3.3 we extend proofs from the preceding two subsections to the general 
case allowing incomplete data on all categories. The expression for elements 
of the asymptotic mean vector is identical to that of Subsection 4D.3.2. 
Expressions for elements of the asymptotic inverse covariance matrix are 
those of 4D.3.2 plus terms for those patterns of incomplete data that index 
the dependent variable. 
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In Section 40.4 we simplify results given by the traditional approach 
in Section 40.2 for the inverse covariance matrix. Note that it is only 
by knowing results of Section 40.3 and by using much algebraic manipulation 
that we can simplify these equations to those given by the nontraditional 
approach. The algebraic manipulation is so extensive that numerous human 
errors occur. Hence, knowing the final result at which to aim is critical. 
It allows continual checking and correcting of various parts of the 
equations. 

Therefore, the nontraditional approach will be useful in other kinds 
of problems when the traditional approach gives unwieldy results. Because 
we are piecing together densities of different dimensions and on different 
combinations of variables, the notation for the nontraditional method in 
Section 4D.3 is necessarily complicated. However, for most types of pro- 
blems, notational difficulties would not exist. 

Section 4D.5 concludes the appendix with three examples allowing, 
with the help of the MACSYMA symbolic computer system, exact solution of 
the nonlinear system of equations for the asymptotic mean. We also give 
numerical illustrations for the asymptotic mean, inverse covariance, and 
covariance matrices. Note that the exact solutions can be used for the 
Taylor-series approximate posterior mean and the posterior mode, as well 
as the asymptotic mean, which is the maximum likelihood estimate. Because 
an exact solution exists only in very special cases and is expensive, 
however, it cannot generally be used. A general method to solve the non- 
linear system of equations is the £M iterative algorithm of Dempster, Laird, 
and Rubin (1977) discussed in Section 2.3.2. 
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4D.2 Traditional Approach : 

The prior density g(p) in (1.1) is continuous in p and the likelihood 
h(z|p) in (2.8) is regular. Therefore [Cox and Hinkley (1974, p401)], the 
limiting distribution for the posterior density f(p|z) is multivariate 
normal with the vector p for 

p = (z + Z z (5 /0 )/n (4D.1) 

from (2.36) of maximum likelihood estimates for the mean u and the matrix 

{- 3 2 log[h(z|p)]/(3p9p' )} a (4D.2) 

for the inverse covariance matrix. 

Recall that the multinomial density is a member of the exponential 
family, where we define the exponential-fami ly parameters 

4> i = log(p i /p k+1 ). (4D.3) 

As in Section 2.3.1, let 

t-(x) = z.+ Z z n (l) (4D.4) 

1 ~ 1 D3i 

for the (unknown) number of the z Q observations that fall in category 

C.j . Then, as noted in Section 2.3.2, Sundberg (1974) proves that 

3 2 log[h(z|<}>)]/{34)3ct>' ) = -cov[t(x) 1 0]+cov[t(x) | z,<f>] , (4D.5) 

where h(z|(j>) is the likelihood h(zjp) written in terms of <j> instead of p. 

Since the first partial derivatives are zero at the maximum likelihood 
estimate, application of the chain rule to the negative of (4D.5) with 
evaluation at p=psu yields 
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"loq h 


9P i 3P j 


p=u 


k 9(f>. 
- E — 
a=l 3p 




‘k a^log h dtp. 


b=l a* a 3* b 3 PjJ 


p=u 


(4D.6) 


k d<p j k 3<j> 

Jl 9pT) b = 1 [C ° V(t a’ t b l * ) - COv(t a’ t bl?-! )1 9?- 


p=u 


Now, 


cov(t 


A 1 * 1 ] 

p=u 


nu (1-u ) for b=a 

a a 


-nu u for b^a 

d D 


(4D.7) 


and 


cov(t t Jz,<|>) = { 

~ ~ J P=u 


^ a Z D U a ( V U a )/u D for b=a 


- E z u u K /u‘ 
n . l D d b D 
DBci j b 


for b?«a. 


(4D.8) 


From (4D.3) 


a^./apj] 


p=u 


(u i + Vl ,/(u iVl ) for J=i 


1/u 


v_ 


k+1 


for jj«i. 


(4D.9) 


Applying Theorem 8.3.3 of Graybill (1969, pl70) to (4D.7) yields that 

r 


a 


. iJ 

(1) 


< u i + Vi )/(ra *iVi' for 3-1 


(4D.10) 


Vl /n 




for the i>j th element of { [cov(t| 4>)lp_ u } x . Hence, we note from 


for ji*i 


-1 
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(40.9) that, for all l-i,j-k. 


3<t>, /9pJ 

J p=u 



(4D.11) 


Substituting from (40. 7) - (4D.9) and (4D.10) - (40.11) into (4D.6) 
and writing 3 2 log[h(z|4>)]/(3u3u' ) to mean (3 2 log[h(z|<l>)]/(3p3p' )) , 
yields that 


-3 2 log[h(z|^)]/(3u.3u.) = (u.+u k+1 ) 2 [ n (l-u i )- z . ( z d /u D 2 ^ V^^Vk+l^ 

D^i 

k k 

+2 ( u i + Vl ) j i ! - nu a + 0 J j!a z oV u D 2 ^Vl 2+ ^ 1 u a [,,(1 - u a ) - D J a (z O /u D 2 ) 


^ (u 0 -a )+ b 5 i>a ( -"V4 ib z D u b^D 2, J / Vl 2 


= (^ + Vl ) 2 [n( 1 " U i ) ’ D Z i Z D ( V U i )/u D 2 ]/{u i U k+l 2) (4D,12) 

k 

+2 ( u i +u w ) '- n(1 - u r u w ) * a y D J i>a u aV u D 2 >i / Vi 2 


+{n(1 - u 1-Vl ) - a 2 j u a [ D 2 a Z D ( V u a ,1/u o‘ 


-n( 1 -«r“kn ,2+ J j u a [ J jia ( D ^ jb z D U b /u 0 2 ) 1} /Vl 2 ' 
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Similarly, for the 



element of the asymptotic inverse covar- 


iance matrix, we have that 


■3 2 log[h(zU)]/(9u.3u j .) = (u.+u k+1 ){[n{l-u.}-^z D {u D -u.)/u D 2 ] 


+( Wl ,[ ’ n+ S V u 0 2]+ E U b [_n+ r Z D /u D 2]}/ Vl 2 

J k 1 D3i,j U U bjh‘,j D D3i,b U U k+i 


2 *■ 2 
+u {u [-n+ z z /u ]+ Z u,[-n+ I z / u ] 

3 1 D3i,j D 0 b^ij b Dsj.b D u 


(4D.13) 




+ l U {U [-n+ Z Z n /u 2 3+(u.+u. ^[-0+ Z z /u 2 ] 
a*i,j a 1 Dai, a D 0 J k+1 DaaJ ” u 


+ Z u [-n+ Z z /u 2 ]+[n(l-u }- £ z ( V u a )/z 0 2]}/u k+l 2 ' 
b^i,j b Daa.b 0 D a D3a u u a u k+1 

b^a • 
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4D.3 Nontraditional Approach : 
40 . 3 . 1 Theory 


From (2.9) the k-dimensional posterior density of p given incomplete 
data z is 


f(p|z) = g(p)h(z|p)// p g(p)h(z|p)dp 
~ ~ ~ p k ~ ~ ~ 


(4D.14) 


for the Dirichlet prior (1.1) 

k+1 k+1 k+1 v.-l 

g( P ) = [r( I V.)/ n r(v.)] n P . 1 

i=l 1 i=l 1 i=l 1 


and the likelihood (2.8) 


h ( z | p ) = n{[( £ z p) !/ n z p! ] 

~ ~ v % e p ^> r !cP £> r 


$ep 


n p. 



Here p takes values in the k-dimensional probability simplex P.= 
k+1 K 

{(Pp . . . ,P k+ ^) ip^O, £ p i = l} ; v^>0; 3 is a nonempty subset of (1,2, . . . ,k+l}; 

P is a set of mutually exclusive and exhaustive subsets 3; 3,P is the set 

element % in the set P; B 0 „ is the number of elements in 3,P; z„ „ is 

the number of .observations such that each observation falls in one of the 

B, categories C. for i&3, but is not further classified into a particular 

P f r 1 

one of these 3, v categories if Ba D >1; and z is the vector of z 5 = £ z „. 

J,r ~ p 9 r 

Since £z 7 and we can cancel from the numerator and denominator of 

p P 

the posterior density (40.14) any terms that are not a function of p, we can 
write the posterior density (4D.14) as 

k+1 v.-l z„ k+1 v.-l z. 


■S > i V' "A * • X V -0. 

f(p)z) = n p 1 n P V/ p [Up , 1 n P *]dp. 

i=i 1 $ * p k i=i 1 r 


(4D.15) 
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In (4D.15) the second product is over all distinct $. Thus, the product 
is over k+1 sets 2^={i}, 1-i-k+l, containing one element and k sets 

^k+l+i’ l" l£|C » containing more than one element. Each of the k latter sets 
correspond to a different pattern of incomplete data. From these k+l+K 
sets, we will make k+ 1 Dirichlet distributions D(l) for I-I^k+1. 

To do so, reorder the terms in (4D.15) so that the first k+1 multi- 

W 1 

plicands are those terms p i , 1-i-k+l , for which 2={i} contains only 

one element. Denote the remaining k sets %, those sets containing more 

than one element and indexing a unique pattern of incomplete data as 

Q ( 1 , 1 ) for 2-1 -k+ 1. For 2*1 *k+ 1, multiply p Q ^ ^ by p i 11 {i} 

for all i (1,1), where the ratio 0<r.-.<l, E r <1, is to be determined. 

1 1=2 1 1 

Define Q(l,2),...,Q(l,k-q(l)+2) as the k-q ( 1 )+l sets indexing the r^z^.j 
where q(l) is the number of categories among which Zq^ ^ is shared. 

Define Q ( 1 ) as the set of these k+2-q ( 1 ) mutually exclusive and exhaustive 
subsets Q(l,j) for 1- j-k+2-q ( 1 ) . 

For example, for z= ( z i» z 2» z 3» z 12* z 13^ we ^ ave that k = K= 2» q(2) 
=q(3)=2, Q(2,1)={1,2) , Q(2,2)=(3), Q ( 3 , 1 ) ={ 1 , 3} , Q(3,2)=(2>, Q(2)=({1,2}, 


{3}}, and Q ( 3 ) ={ { 1 , 3) , { 2} } . 

Now, for 2-1 -k+I, we have multiplied p nn , \ in (4D.15) by 

gU,lj z,.,+v_.-l 


r,,z 


■(i) 


1 * 1 f * 1 

p, 1 for i (1,1). Accordingly, for each i^Q (1,1) multiply p- 

' -r..Zr.j 

in the product of the first k+1 terms by p^ 1 1 . This process yields 

the posterior density (4D.15) as 

k+1 

f k+1 (1 " 1 ?2 ril)Z{i}+Vi_1 

f(p|z) = H p 
-• 1 ' 


i=l 


K+1 

n 

1=2 


) n Q(1 » 1} n p . 

Q(1,1) i^Q (1,1) 1 


r il Z {0 


(4D.16) 


// D (numerator) dp. 
K k 
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Multiplying and dividing (4D.16) by 


k +1 k +1 k +1 k +1 

r{ Z [{1- Z r.,)z,.,+v .]}/{ n r[(i- Z r )z f ,+v.]} (4D.17) 

i=l 1=2 11 111 1 i=l 1=2 11 1 


for the first set of multiplicands in (4D.16) and, for 2-l*hc+l, by 


r[Z Q(l,l) + 


VQ(U) 


r il Z {i} +P(1)+1]/[r(z Q(l,l) 


n r(r z +1)] (4D.18) 

i^Q (1,1) 11 {1} 


for each of the remaining < sets of multiplicands yields the numerator and 
integrand of the denominator of (4D.16) as a product of k+1 Dirichlet 
densities, where the 1th density has dimension 


■0(1) = k-qCD+l 


(40. 19) 


for q(l) again the number of elements in Q(l,l); that is, for num denoting 
numerator, j s eQ(l,l) for l^s-k-q(l), and, paralleling notation from Wilks 
(1963, pl78), d(xj,...,x k ;x k+ ^) denoting the k-dimehsional density of the 
Dirichlet distribution D(x^,. .. ,x^;x^ + ^), 

K+1 k+1 K+1 

num[f (p| z) ] = dC (1 " 1 f 2 r n )z { 1 } +V l’ * * * ’ (1 " 1 f 2 r kl ^ Z {k> +V k ; (1 '" 1 f 2 r k+l,l )z {k+l} 

K+1 

+v, .] x n d[z / . , 1 + 1 , r . I z | . ,+l,...,r . l z fi > + ^’ 
k+1 1=2 QU.l) Oi 1 J k-q( 1 ) 1 iJ k-q(l) j 


r k+l,l Z {k+l> +1] ’ 


(40.20) 


In (40.20) we assume that there is at least one category, say C k+1 , on 
which all data is complete. The completely generalized case of at least 
some data being incomplete on all k+1 categories is more complicated. 
Therefore, the general case is deferred until Section 4D.3.3 where we out- 
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line the basic conversion of that case to the one of this section and give 
results for the mean and covariance matrices. For shorthand, we refer to 
the k+1 densities on the right-hand side of (4D.20) as simply d(l), d(2), 
...,d(«+l), respectively. Note that r^ in (4D.20) is often zero for most 
1 ) and 2-1 -k+1. 


Since the limit of a product is the product of the limits of the mul- 
tiplicands and (Appendix 4B) the limiting Dirichlet distribution is multi- 
variate normal, the limit of the numerator (4D.20) of the posterior distri- 
bution is a product of multivariate normal distributions. That is, 


k+1 


lim[ (40.20)] = n Np (1) (y (1) ,Z (1) ), 


(4D.21) 


n-x» 


where u^ = (y.^) and E^=(o..^) are the 0(1 )*l-dimensional mean and 

*** ~ * J 

0(l)xO(l)-dimensional covariance matrices of p given data 


= z n/1 . ,+ I [r ..Zr . -i+l ] 

Q(u) i«(i,d 11 {i) 


if 2-1 -k+I and 

. 0 ) 


m’ 


k+1 k+1 

Z [(1- E r)z f +v] 
i=l. 1=2 1 l1} 1 


(4D.22) 


(4D.23) 


if 1=1. Thus Wilks(1963,pl79) , for l^k+1, 


u i (1) =[ ' 1 -^ r ii )z {i} +v i !/m(1); 


for 2-1 -k+1. 


y l (1) = (Z Q(1,1) 


+l)/m 


0 ). 


for 2-1 -k+I, 2^P(1)+1, and j. ^0(1,1) for l^j. ^-k+l, 


0) . 


(r. .z,. x +l)/m 

Ji-i 1 


0 ). 


(4D.24) 


(4D.25) 


(4D.26) 



-142- 


and, for l-i-P(l), i<j-P(l), and 1^14c+l, 


and 


a„ (1) - - u.. (1 V. (1) /( m (,) + l). 


1J 


1 J 


(4D.27) 

(4D.28) 


In most of this appendix we find it more convenient to refer to 
elements of the 1th mean, covariance, and, particularly, inverse covar- 
iance matrices for 2-1 £ k+ 1 in terms of sets % and T, for % and T each one 
of the 0(1 )+l sets Q( 1 , 1) , Q(l,2)={j i _ 1 },..,Q(l,0(l)+l)={j p(1) }; Tt$. 
Accordingly, for 2-1 -k+I, define 

y (i} (1) = ( r il 2 {i} +1)/m(1) ’ (40.29) 


for *=Q(1,1) 

I for 


(4D.30) 




(1-P. 


( 1 ) 


)/(m ( 1 } +l). 


(4D.31) 


and, for T^2, 





(4D.32) 


Because the multivariate densities in (4D.21) are of differing dimen- 
sions and on differing combinations of the same random variable p, we do 
not immediately. have that product (4D.21) is a k-dimensional multivariate 
normal density on the k components of p. However, we now show that, owing 
to the special relationship between d(l) and d(l) for 1>1, product (40.21) 
is also multivariate normal in the limit. 
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To begin, sum the k+1 exponents in product (4D.21). For (4D.21) to 
be normal, we must be able to write this sum as the exponent of a normal 
density; that is, we must be able to write 


^p^W 1 ))' = (p-u)s -1 (p-u) 


(4D.33) 


for 0-u-l and S~* positive definite and symmetric, where p^ is the k- 
dimensional vector (p^,...,p^) if 1=1 and the t?(l)-dimensional vector of 
Pq( 1 Pgp 2 )*•••» and Pq(] p(i)) for the 1th Dirichlet density in 
(4D.20) if 1>1. 

Expanding the right-hand side of (4D.33) yields, for S 1J the i,jth 
element of S * for l-i,j-k, 


k-1 k 


l (p.-u •) 2 S 11 +2 Z Z (p.-u )(p -u )S 1J = Z p. 2 S 11 +2 Z E p.p S lj 
i=l 1 1 i=l j>i 11 3 3 i=l 1 i=l j>i 1 3 


(4D.34) 


|< 

-2[ Z p.u.S 1 V Z Z (u.p.+u.p.)S 1,5 ]+[ Z u.^S n +2 Z Z u.u S 1J ] 
1=1 1 1 1=1 j>i 1 J J 1 i=l 1 1=1 j>i 1 J 


k ? .. k-1 k 


Expanding the left-hand side of (4D.33) yields, for a (i) 1J the Ujth ele- 


-1 


3 


-1 


ment of z^ * for l^i , j-k and that element of z^^ referenced 

by the sets 3 and T in Q(l) [recall (4D.30) - (4D.32)], 

k (PrV")\i) M+2 .l . . 

j>i 


£ (P i -P i ^ 1 h 2 0/ 1 x 11 +2 Z Z ( p • - y i ^ 1 ^ ) ( p • - y • ^ 1 ^ ) a / x j 1 ^ 
i=l 1 1 i=l j>i 11 J J u; 


(4D.35) 


+ Z [ Z (p_-y (1) ) 2 o (1 )^ +2 Z E 
1=2 3€.Q ( 1 ) ^ ^ ' 1 ' «en/iVTenm f> ? T T 


3^{k+l) 


3eQ(l) TeQ(l) 

T(1)>3(1) 
Tf\ k+1} 


In (4D.35) 2(1) means the first element of the set %. Thus, if 3={4}, then 
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3(1)=4. If 2=Q(1,1), the 55(1) is the first of the q(l) linearly ordered 
elements of Q(l,l). Hence, if Q(1 ,1)={2,3,9}, then ?(1)=2. 


Recalling that p^= Z Pj» we expand coefficients of and 

1 

for all i, j, and T to write (4D.35) as 
k p k-1 k k 

£ i P-i c ni p i +c nn (4D.36) 

1=1 11 1 i=l j>i 1J 1 J i=l 01 1 ou 

where, for 

1 if A and B both true 


$T 


A,B 


0 otherwise. 


(4D.37) 


the coefficients c . ■ in (40.36) are, for 1-i-k, 

1 J 

i i K+ ^ 22 

C 11 " °(D + >2 [ ?«g(l) Si ” i, * >l0(,) !i 


for j>i , 


*U' 2l0 (i)' J+ 4 ^ ( ,) (4 *»i.»»J a (’) * + Te q(i)*»i.T*J o 0) )]i 

Tft (4D.39) 


(4D.38) 


‘01 ■ - 2 ^ <1) °(l) ,i+ j/j (I)o ( l ) ij+ 3X.q(l) ({ ^^ lM?(,) °( , ) W 


+ i4(i) 6 ^^ iUr(1)a(i ^ T)]}; 

m 


(4D.40) 


and 


'00 


= Z (y i (1) ) 2 a (1) 17 +2 k E 1 2 y i (1, p j (1) a( 1 ) lJ 
i=l 1 i=l j>i 1 J 

+ M 1 [(y, ( 1 ) ) 2 a (1) ^2 z VVSu* 1 ]} 

1=2 ?eQ(l) * ^ ' Tcnt^'\ * 

#k+l 


(4D.41) 


T*Q(1) 

T( 1)>3( 1) 
T^k+1 
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where . 


Then, equating coefficients of p,. in (4D.34) and (4D.36) yields 


that, from the coefficients of p/ for, 1-i-k : 


C n 

S = a 


ii K+1 

’+ Z [ Z 6, 


1=2 2«Q(1) ^ »^ 1 ' a ( T ) 


3? 


]. 


(4D.42) 


and, from the coefficients of p.p., for 1-i-k , i<j-k: 

* J 


K+1 


$iJ • t r4o)^i^ a <’) ?T)1 ' 

Tf% (4D.43) 


Hence, elements of the inverse covariance matrix S" 1 are finite 
linear combinations of variances and covariances and thus are of the 
same order of magnitude as the comprising variances and covariances. 
Therefore, from (4D.27) - (4D.28) and (4D.31) - (4D.32), elements of S, 
if S exists, are 0(n _1 ). 

Continuing to equate coefficients of p^ , we have that, from the 
coefficients of -2p^, for l^i-k. 


Z UjS 1 ^ = I y.^a 


j=l J 


j=f J 


ii K+1 
1J + Z [ Z 


(«i 


(D l=Y*e$(l) 


OL 33 
CT (D 


(4D.44) 


+ TcQ<l) 

T n 


Substituting from (4D.42) and (4D.43) into the left-hand side of 
(4D.44) yields the left-hand side as 


k 

Z u.[o 
j=l 3 
J7i 


(1) 


ij* +1 


+ ,E, J„„ ; (< *»1 .«°<1 )“\J„ )'■ > 1 




1=2 teQO) 


^Q(l) 

m 


(4D.45) 
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-•j K+l go 

tU i [0 (l) + 1 =2 < 2 4(l) 6?5i * 1!3l0(,) " 


K+l 


= Z u.a 


1J 


A UjU(1) + i= 2 { ?4(i) t j=i Uj ' 5z,1 - Z3j< ’ (1) 

TcQ(i) j=l J 5531 * 11 > 


n 




Equating coefficients of o^ 1 ^, 1-i , j-k , and on the 

left-hand side (4D.45) of (4D.44) with those on the right-hand side of 
( 4D. 44) yields that 


from the coefficient of 



u. 

i M i ’ 

(4D.46) 

from the 

22 

coefficient of °(-|) 



m k 

i,fc»i u 2 = ^j U j 6 ?3i ,*>j ; 

(4D.47) 

therefore, 



y 2 (1) = E u j’ 

(4D.48) 

so that, 

from (4D.48) , 


and 

v<<> = 

* j«S 3 

(4D.49) 

from the 

ST 

coefficient of 



6 *?i,7?i y T (1) = -Ifa ,T>j U j = -I"]' 

j^i .... 
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so that, echoing (4D.49), 

y T ^ = Z y.^. (4D.50) 

' j€T 3 

The last step in equating coefficients of p. is checking that for 

values (4D.46) for u^ , (4D.42) for S 1 ^, and (4D.43) for S 1 ^, the constant 
k • • k~ 1 k 

term E u.^S 1 ^ E E u.u.S^ from the right-hand side (4D.34) of 
i=l 1 i=l j>i 1 J 

desired identity (4D.33) equals the constant term (4D.41) from the 
left-hand side (4D.36) of (4D.33). 

Substituting for u i , S 11 , and S 1J in the constant term of 
(4D.34) yields that 


£ u i 2 S 1T +2 k E 1 Z u.u.S lj = E (y i ( 1 ) ) 2 [a ( 1 ) 11 +V( Z 6, . g 3 i a (1) 22 )] 
i=l 1 1-1 j>i 1 J 1=1 1 (1} 1=2 $€Q(1 ) 5531 * 231 (1) 


+2 k ; 1 v u (l) u (l)r a ij/y 1 y ( x a n 

yj [ (1) 1=2 W(l) (6 ^-W ( 1 ) 


T£Q(1) 23i ^ 

Til (4D.51) 


IT 


)] 


= k i (y i (1) ) 2 a (1) 11 + 2 k i 1 ^ 

i=l 1 u ' i = l j>i 1 3 u; 1=2 


E 

SCQ(l) 


x [( Z + 2 z ( E y i (1) )( z yj (1) )o (1) 2T ]>. 

i€3 u ' TeQ(l) ies 1 jcT ^ l 1 ' 

T(1)>3(1) 

Thus, since (4D.49) gives that E y.^=y«^, the constant term (4D.51) 

j«* J * 

from the right-hand side of (4D.33) equals the constant term (4D.41) from 
the left-hand side of (4D.33). 

Remaining in our proof is to show that O-u^l and that S~* is 
positive definite and symmetric. From equation (4D.46) for u^ , 
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definition (4D.24) of r. ^ 1 ^ , and bounds on ratios r^, l^i^k, 

2-1 -k+1, we have that O-u^l. Before proceeding to the remaining proof, 
we note that S'* being positive definite and symmetric implies that 
S=(S -1 )" 1 exists [Graybill ( 1969 ,p318) ] and is positive definite and 
symmetric [Anderson (1958,p337) ] ; hence [Dempster ( 1969 ,p41 ) ] » S is a 
covariance matrix. 

The matrix S" 1 is symmetric because, from (4D.42) and (4D.43), each 
element S 1J of S" 1 is a finite sum of elements from inverse covariance 
matrices Z^" 1 , each of which is symmetric by definition of covariance. 
From Dempster ( 1969 ,p41) , the matrix S" 1 is positive definite if and 
only if yS _1 y'>0 for all k-dimensional y^O. Thus, let y be any k- 
dimensional vector such that y^O. Then, 


k k 

ys'V = i z y i y i s 

~~ ~ i=l j=l 1 J 


k k 


• K+1 


z z y.-yJo/, \ + z z 

i=l j=l 1 J 1=2 ScQ(l) 


n 


rn 

k k • • k+ 1 k 

' J 


(4D.52) 


i=l j=l 1 J U; 1=2 i=l j=l 1 J 2cQ(l) * 1,55 J ^ 


5555 


+ T<4o) S23i>T,j “ (1) 

m 


|0#1A ?7 )]. 


Since y^O and the inverse covariance matrix Z/.x"* is positive 

k k . 

definite, the first term Z Z y_jy.a m J in the last equality in 

i=l j=l J u ' 

(4D.52) is positive. It therefore remains to show that the remaining 
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term is nonnegative. We can write this term as 
lk k 

i i 1 [ i y i y 1 6 Sai E T+i a m %T l 

1=2 i=l j=l $£Q(1) 1 j * 1,?3j V) TeQ(l) 1 J ,T^j (1) 

Ttf 

= E [ Z ( Z y/)o n A E ( E E y.-yjo/^ 37 ] (4D.53) 

1=2 $€Q(1) i£$ 1 TtQ(l) i£$ jcT 1J U ' 

m 

K+l a-r 


= E [ E 


Li l L, L, Wqinj- 

1=2 *£Q(1) T<6Q(1) * 1 


E WgW T a p ] - 0, 


for w « = E y. and w T = E y., since every matrix Z/,\ , 2-1 -k+1, 

55 i£? 1 ' j*T 3 

is positive definite so that the term within brackets is nonnegative 
[positive unless w^ = 0 for all ?eQ(l)] and since the sum of < non- 
negative numbers is again nonnegative. Therefore, the matrix S" 1 is 
positive definite. 

Thus, for values (4D.46) for u^ , (4D.42) for S 11 , and (4D.43) for 

S 1J , equality (4D.33) holds; that is, we can write the sum of exponents 

in the limiting numerator (4D.21) of the posterior density as the 
exponential term of a k-dimensional multivariate normal density. 

Now, the limit of the posterior density (4D.16) is the limit (4D.21) 
of the numerator divided by the limit of the denominator. To calculate 
the limit of the denominator, first note that a Dirichlet density is 
continuous in p. Further, from (4B.29), Appendix 4B, a multi- 
dimensional Dirichlet density uniformly converges to a multivariate 
normal density. Therefore, the product (4D.20) of Dirichlet densities, 
the numerator of the posterior density, is continuous in p and 


uniformly converges to a product of multivariate normal densities on the 
closed and bounded k-dimensional set P^. Thus [Buck ( 1965 ,pl86 ) , 
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Bartle ( 1966 ,p67) ] , we have for the denominator of the posterior density 
(D.5) that 

lim / p num [ f ( p j z) ] dp = / p Tim {num [f(p|z)]> dp. (4D.54) 

n-*» r k n-*» 

Therefore, canceling coefficients n (2 tt)~^ ^[det(Z^ ] -1 ^ 2 

1=1 

in the limiting numerator and denominator of the posterior density and 
multiplying both by (27r)“* < ^[det(S) ]~^ 2 yields the limiting denomina- 
tor as 1 and the limiting numerator, and thus the limiting posterior 
density, as the density of the k-dimensional N^u.S) multivariate 
normal distribution with elements of the mean and covariance matrices 
given by (4D.46), (40.42), and (4D.43), respectively. 

Rao [1968,(xv)pl04] proves that if the density of a random variable 
converges to some density, then the distribution of the random variable 
converges to the distribution for the limiting density. Therefore, we 
have proved for all cases but that in which all k+ 1 categories have some 
incomplete data, which case will be considered in 4D.3.3, that the limit 
of the k-dimensional posterior distribution of p given incomplete data 
is k-dimensional multivariate normal. 

4D . 3 . 2 Speci_al_ Case j_ 

In Section 4D.3.1, elements (4D.46), (4D.42), and (4D.43) of the 
asymptotic mean vector u and inverse covariance matrix S~*, respectively 
were expressed in terms of unknown ratios r^ , 1-i-k+l, 2 £ 1*«+1, and 
incomplete data z. In this subsection we eliminate these ratios and 
derive expressions for elements of the asymptotic mean and covariance 


t IM 
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matrices in terms of the asymptotic means and the data z. Again we assume 
that there exists at least one category, say C k+1 , on which all data is 
complete. The next subsection treats the general case allowing all 
categories to have some incomplete data. 

Recall from Section 4D.3.1 that for k different patterns of in- 
complete data, we separated the numerator of the posterior density into 
k+1 Dirichlet densities d(l), 1-1*k+1. For the first density we had 
complete data on all k+1 categories , 1-i-k+l, and for each of the k 

remaining Dirichlet densities d(l), 2-k-K+l, we had exactly one of the 
k sets of incomplete data. Recalling from (4D.19) that P(l) is the 
dimension of the 1th Dirichlet density for 2-1-k+I, note in (4D.20) that 
for each of the last k Dirichlet densities, there are P(l)-l unknown 
ratios r. , l-i-l?( 1 ) -1 . Thus, there are a total of k[X?( 1 ) -1 ] unknown 

V 

ratios r. l-i-fl(l)-l, 2-1 -k+I. 

J i 

From (4D.24) - (4D.26), (4D.29), (4D.30), (4D.46), (4D.48), and 
(4D.49), elements u^ of the asymptotic mean vector u are expressed in 
terms of these k[P(1)-1] unknown ratios. Letting 1 range from 2 to k+1, 
we could derive a system of k[P(1)-1] nonlinear equations in the k[P(1)-1] 
unknown ratios, from which solution the asymptotic means could be 
evaluated. 

However, an easier approach to evaluate these means is to reexpress 
them in a way that eliminates the ratios altogether and leads to a sys- 
tem of just k nonlinear equations in k unknowns, the unknowns then 
being the means. In such an approach, we will have evaluated the means 
in a one-step, rather than two-step, process and the nonlinear system to 
do so will be much simpler. 
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Recall that r^=0 for any Dirichlet density d(l) for which 
category i has incomplete data. From (4D.49), for 2 £ 1 4 k+1, 


'1 


(1) „ 


Z Mi 

htQ(l.l) 1 


( 1 ) 


(4D.55) 


For and m^ ^ defined in (4D.22) and (4D.23), respectively, substi- 
tution from (4D.24) and (4D.25) into (4D.55) yields that 


(z 


3(k+l) 


+l)/m 


( 1 ) .. 


K+l 


E t(i - z O z *hi +v h ]/m(1) - ( 4D - 56 ) 

hCQ(l.l) g=2 h9 ihJ h 


Similarly, use of (4D.49), (4D.24), (4D.26), and p.,. 3 !- Z p« , 

k 1 ?€Q(1) * 

for all 1, yields that, for 2-i^P{l)+l, S#{k+1} 


(r- iz x . ,+l)/m (1 ) = [(1 - Z r. n )z x , ,+v. 


K+l 


(1) 


(4D.57) 


j i i‘{j i ) " lVi ' g r 2 J‘i 9 { ji> J'i 

Hence, ignoring terms 1 and v h that go to zero as the sample size 
n increases, we have from (4D.56) and (4D.57) that, for 2*i-P(l)+l, 


2-1 -K+l, and 1- j ^ -k+l , 


K+l 


" (1) /* (1) = ’W h au ) (1 ' g => )z <» 

K+l 

= r. ,/d - Z r. ), 

V g=2 J i 9 


(4D.58) 


whence, defining u 9 = Z u_. for all sets 2, we have from (4D.46) that 


, 0 ) . 


m'-' = ZnM n / Z u 


= z 


Q(1 * 1} h.Q(l,l) h 

Q(l ,1 ) /u Q( 1 ,1) ’ 


(4D.59) 


Therefore, from (4D.24), (4D.46), (4D.48), and (4D.58), for 1-i -k+l 

k+l K+l 

and n, 1 i; 1 z m + i=2 Z °<i-i> 


the sample size, 
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u . 


K+l 

= (1- Z r n ) 2 {1) /m 


( 1 ) 


1=2 


K+l 


( 1 ) 


= [z {i} /n][n(l- Z r.^l/m 

= [z {i} /n][(m (1) + Z 1 m (1) )(l- K Z 1 r. 1 )/m (1) ] 
Ui 1=2 1=2 1 1 


K+l 


K+1 "t 1 JD.Jl). 


= [z m /n][(l- Z r.,)(l+ Z nr '/m v '+ E 

m 1=2 11 1=2 1=2 

Q(l,l)3i Q(1 ,l)3i 

K+l K+l K+l 

= [z { . } /n](l- Z r ){1 Z z , ,./ z [(1- Z r )z ] 

111 11 1 = 2 Q( h€Q(l,l) . g=2 hg {h} 


1=2 


Q(U)3i 


K+l. 


K+l 


K+l 


{z + z [(1- Z r.,)z,.,/m v '] 
{1} Q(l,l)3i 1=2 11 


+ Z r /(l- z r.,)} 
1=2 11 1=2 11 
Q(U)*i 

( 1 ) 


(4D.60) 


K+l 


/ Z [(1- Z r. )z /m (1) ]z nn n )/n 
h GQ (1,1) g=2 hg {h} Q(1 ’ 1} 


[Zr 0 + Z (u / Z U )z ]/n 
li; D3i 1 j£D J u 


[Z (i) + D3i ZDUl/U[,1/n ' 


Note that (4D.60) is the maximum likelihood estimate (4D.1). 

Successively setting i=l,...,k in equation (4D.60) yields a system 

of k nonlinear equations to solve for the k unknowns u., 1-i-k, where 

k+l 1 

we also have the constraints O-u.-l and Z u.=l. Some of the numerous 

1 i=l 1 

approaches for finding a numerical solution are outlined in Scheid 
( 1968 , chp t . 25 ) . As discussed in Section 2.3.2, Dempster, Laird, and 
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Rubin describe an algorithm for an iterative solution. Examples allowing 
exact solution are given in Section 4D.5. 

Noting from Graybill ( 1969 ,pl70) the form of elements of the inverse 
covariance matrix S" 1 and recalling (4D.27), (4D.28), (4D.31), (4D.32), 
and (4D.42), we have for large n that 


,11 li 

S -a 


^ Q(l,l)ii a(1 ) 

= (m^ + 


T a Q( 1 »1) .Q(l ,1). 

L Of \ + 


Qd.i »V (,) 




Z ^ ) ( u • "^"U i , » ) / (u.u.,. )+ Z n/ ^(Un/, -ix+u.,..) 

Q(l,l)*i 1 k+1 1 k+1 Q(l,l)->i Q(1 ’ 1} k+1 

(4D.61) 


^ U Q(1 ,l) u k+l^ 


= n(u i +u k+1 )/(u i u k+1 )- D Z_(z D /u D )(u D -u i )/(u i u D ) 


since 


Ml K+ 1 Ml 

m v ' = n- Z nr ' = n-( Z z n /u n + z z n /u n ). 
1=2 Dai D D D*i D D 


(4D.62) 


Similarly, from (4D.43) and for "Q(l ,l)s>i ,j" beneath a summation 
sign meaning Q(l,l) containing both i and j. 


S 1J =m^ 1 Vu,..i+ 


, 0 ) 


ktl+ q(i.i)^./ (u QO.i) +u w )/(u Q(M) u w )+ q(1>1) ^ >j 




k+1 


''"'os! J Z D /U D )/U k+l + lJ 5i j ( V u D )(u D +u k+l )/(u D l W <4D - 63) 


= n/ ViV- . ( V u o )/u o- 

U31 ,J 


Note how simple final results in (4D.62) and (4D.63) are, especially 
compared with corresponding equations (4D.12) and (4D.13) from the tra- 
ditional approach in 4D.2. Furthermore, final results in (4D.62) and 
(4D.63) parallel results (given by their first term) for complete data 
[See Graybill (1969,pl71). ] 
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4D.3.3 General Case: 


As long as there is at least one category having no incompletely 
specified data, we can apply the methods of the preceding sections. That 
is, if category has some incomplete data, we can change the dependent 
variable from p^ + j to any variable p^ for which category C. has only 
complete data. However, there are cases in which no category has only 
complete data; i.e., all k+1 categories have some incomplete data, so 
that such a variable does not exist. In this section, we extend theory 
from the preceding sections to this remaining case. 

The only time there are problems using the theory of the preceding 
subsections is when Q(l,l) contains that element, say k+1, that indexes 
the dependent variable for d(l). To handle these instances, we have two 
approaches. In the first approach, we write p nM as 1- E p. and 

QU,1) JtQO.l) J 

then proceed with the methods of 4D.3.1 of equating coefficients of 
powers of p^ on the left- and right-hand sides of (4D.33). A simpler 
approach is making Pg^ ^ the dependent variable and then proceeding as 
in 4D.3.1 and 4D.3.2. 

The first approach requires more types of cases than the second 
approach and, unlike the second approach, requires transformation of 
formula for the inverse covariance matrix before allowing proof that this 


matrix is positive definite. Hence, we pursue the second approach. 
Therefore, if Q( 1 , 1 ) contains k+1, then we make Pg^ ^ , instead of 

Pq(T k+q+i)’ the de P endent variable. 

Following this approach and the procedures of 4D.3.1 and 4D.3.2 
yields for elements u^ , S 11 , and S 1J , respectively, of the mean and 


covariance matrices 
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“* = [2 {i> + n J. Z D (u j /u D )]/n ’ 


.11 


D3i 
k+1 


(40.64) 




and 


_ _(1), v k+1 

( u i +u k+ l) / (u i u k+ i)+ e { 


Q(l,l)^k+1 

,0), 


k+1 1=2 Q(M)ji m (u i + Vl )/ ^iVl) 

QO»l)#c+l 


+ , * m ( ^( u 


Q(U)3i %(u) + Vi )/(u Q (u) u k+i )+ z m (1 ) 


Q(l,l)#k+1 


Q(l,l>k+1 

Q(U)2i 


(4D.65) 


* (u Q(U) tu i>/<V Q (U)» 


"(u i+ u w )/( UlVl) . J (z 0 /u 0 )(u D - Vl)/ ( Vk+i) . z , 


DM 

03k+l 


D31 D'> 

D?k+1 


(u 0 - Ui )/( UdUi )- o ^ (z D /u 0 )( Uj+U|:tl )/( UjVi)> 


D3k+1 


S ' J=0 (1) ,J+ ,?! < £ 0m Q <>,l), { j> + E „ {i), {J) 


, 1= 2 Q(1,1) 31 - 0) 

Q(l,l)?k+1 Q(M)^j 


Q(U)*1 (1 ) 
Q(UMj 


f r v q(i,i) on n k+i 

Q(M)^i Q(l,l)3i,j f 1 ) ’ }+ a (i) {l},{j> 

Q(U)ik+l 

fn k+1 0(U1 Jj^i ,aij 

= "" /U -' + 1 [ I m<» /u + r „(1), 

QO,l)»+l Q(U) * U W q(U)3i.j (U 0(U) +u k+l ) 


Q(U)3j 


k+1 
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/(u Q(l,l)Vl> lt Q(1>1) ^ +1 ” (,)/u Q(U) 

Q(l,l)9M,/j 


(4D.66) 


= n/u, .,+ E 


E [ 


E m (1 )(l/ u 


k+1 Q(U)W QO.DsVl Q(1,1)W 


Q(l,l)3i,j 


•1/u. i )-m (1 ) ( E 1+ E 1+ E l)/u ] 
k+i Q ( 1 » 1 ) 3i 5 j Q(l,l)3i Q(l,l)# k 1 

Q(l,l)2j Q(Utej 


= n/u. +1 + E (z /u )/u - E (z /u D )/u + E 

k 1 D^k+1 U U U Dsk+1 u u k 1 D3k+1 


( Z D /U D )/U D’ 


D 9 i , j 


D£i ,£j 


where "Dai.j" means 0 containing both i and j, "D£i,j" means D not 
containing i and j together (ie, D can contain one or neither of i and j 
but not both), and all conditions under a summation sign are to be 
met simultaneously, since, as in Section 4D.3.2, the procedure yields 
that, for 2^1 -k+I, 


and 


for 



Z Q(1,1) /U Q(1,1) 



K+l 

n- E m 
1=2 


( 1 ) 


(4D.67) 


(4D.68) 


u = E u.. (4D.69) 

Q{ 1 » 1 ) jfiQO.l) J 


Proof of positive definiteness of S ^ will parallel that given in 
(4D.52) and (4D.53) of the last section with the following modification. 
Note from the first equality of (4D.65) and (4D.66) for S 11 and S 1 ^, 
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respect i ve ly , that no direct contribution is made to S 11 and S 1J from 
those sets Q(l,l) simultaneously containing both i and k+1. Thus, we 
must modify (4D.53) by adding S^k+1 under both JkQO) and T€.Q{1) every- 
where in (40.53). Therefore, the sums within brackets in (4D.53) for 
these particular sets Q(l) will involve only that submatrix of * 
referring to those variables not indexed in Q ( 1 , 1 )^>i , k+1 . But since this 
submatrix is also a covariance matrix, it is positive definite; thus, the 
remaining proof will follow like that of (4D.53). 

Remaining proofs for the limiting posterior distribution are identical 
to those of Sections 4D.3.1 and 40. 3. 2. Therefore, for all cases the 
limiting posterior distribution of p given incomplete multinomial data z 
is multivariate normal with expressions for elements of the mean and 
inverse covariance matrix given by (4D.64) - (4D.66). Note that, as in 
Section 4D.3.2, expressions (4D.65) and (4D.66) for elements of the inverse 
covariance matrix are simple and parallel those for complete data. 


4D.4 Equivalence of Results : 


In this section, we show how results (4D.12) and (4D.13) for the 
asymptotic inverse covariance matrix given in 4D.2 by the traditional 
approach can be simplified to those, (4D.65) and (4D.66), respectively, 
given by the nontraditional approach in Section 4D.3. Because of the 
large amount of algebraic manipulation (and, thus, possible error) 
involved, knowing results (4D.65) and (4D.66) to work toward is very 
important. 

To show that (4D.12) equals (4D.65) for S 11 , divide (4D.12) into 
the four groupings - complete-data term, sum of all terms for which 
D£i,Dsk+l, sum of all terms for which D3i,D£k+l, and sum of all terms 
for which Dsi,D3k+l - given in (4D.65). Note in making this division 
that there are no terms for which D£i,D#<+l; the one combination for 
which there is no contribution in (4D.65). 

In (4D.12) we can rewrite the complete-data term as 


f’’( u i +u k+i )/(u iVi )H[(u i +u k+i ,( 1 - u i ) - 2 u f < 1 - u r u k-n )+u i ( 1 ‘ u i-Vi ) 1 / Vi ) 

(4D.70) 

* "( Wi )/(u iVi> 

since the term inside braces is one. 

For the sum of those terms in, (4D.12) over those sets D^i, we have 

^• U a [ ~n E *-V U D" U a) /u D + J- n Z h Vb /u D 

a^i Dsa,£i b^i,a Dsa,b D3a,^i 

D^k+1 D£i ,3*k+l D^k+l 


D£i ,3*k+l 

h Z D U b /,U D ^ u k+l 
b^i ,a D3a,b 

,3k+l 


(4D.71) 


' 'a^i Uat oL, 3f i (Z!)/UD ),/Uk+1 

D^k+l 
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D*i 

Dak+l 


[(z d / u D 2)( a J D u a )]/u k+l 
a^k+1 


= ' D*i (z d /u d ,<u d-Vi )/(u dVi> 

D?k+1 

since, inside the brackets in the first line in (4D.71), the first term 
is the negative of the second because, for the restrictions D^i,k+1 on 
D for these two terms, 


u D“ u a = * V 
beD 

b^i ,a 

From the last two terms inside these brackets, we pick up 


(4D.72) 


E z D u k+l^ u D 
D^a,k+1 u K 1 u 


(4D.73) 


since, for the restrictions D^i,D*k+l on D for these two terms. 


u n -u , = E u.+u. ... 

Da b€D b k+1 

b^i ,a 


(4D.74) 


For the sum of those terms in (4D.12) over those sets D3i , we have 


‘•("i tU k+l) Z/ “i 0 “DlV"i) /l, D !+2 < u i t Vl l J 1 D5i,a VD/UD2 


k k 

+ Eu a [ - E z D ( V u a )/u D < E k z D ll b /u D 2)1,/u ' 
a^i Dsa b^i ,a D?a,b 


k+1 


Dai 


D3i 


=t o>i V u d 2 [ - (u dV(VVi ) 2/ V 2 <VW <u d- u i-Vi> 

0>k+l 

-<V u k+l><VWl ))+ D E f z D /u D 2[ -< u D- u i )(u i + Vi )2/u i 

D^k+1 


(4D.75) 
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+ 2 <V u j H VVi )- u i (u D- u i ) 1 }/ Vi 2 
= - D 2 . (V u D><Wi )/(u iVi> - ^ <V u d)(V u i)/( u i u o> 

D>k+1 D^k+1 

since, inside the braces in the first line of (4D.75) we can divide the 
first term into a sum over sets Dak+1 and a sum over sets D^k+1, we can 
write the second term as 


2( Ui +u. .)[ Z z d /u d 2 ( Z u )+ z z D /u n 2 (Zu)] 

1 k+1 D*i U D a*D a Dal D a*D a 
D?k+1 a? 4 i ,k+l D^k+1 aj^i 

=2 <Wl )[ Z D < U D - U i-U k+1 )/U D 2+ z D (u D - Ul .)/u 0 2 ], 
Dak+l D^k+1 

and we can write the last two terms as 

2 u a^~ 2 z D^ u D' u a^ u D 2+ , ,? n 1 . . z D u b^ u D 
a^i D3i ,a b^i ,a Daa,b,i 


(4D.76) 


D^k+1 


tyk+1 


" ? z D^ u D" u a^ u D 2+ , .? _ 1 . . z D u b^ u D ^ u k+l 

Dai ,a b^i ,a Daa,b,i 

Dak+1 Da k+1 (4D.77) 

■■ii^ [ Dj.1 2D ^ /U ‘> 2+ D,a,1 ZD<Ul+Uk+ > )/UD2,/Uk+lZ 

D^k+1 D^k+l 

=-[u, ^ z D (V u i> /u D 2 t(u i +U k+l) D J f z D< u D- u i- u kH )/u D 2 ]'' u k + l 2 
D*k+1 D*k+1 


since the first two terms inside brackets in the first line of (4D.77) 
combine through 
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u n -u = Z u. 
D a b€D b 
b^a 

and the last two terms combine through 


(4D.78) 


u n -u, = Z Ui+U.,.,. 
D a bcD b k+1 
b^a 


(4D.79) 


Therefore, from (4D.70), (4D.71), and (4D.75), we have for S 11 that 
(4D.12) from the traditional approach does simplify to (4D.65) from the 
nontraditional approach. 

Similarly breaking up terms given in (4D.13) for S 1J from the tra- 
ditional approach, we have that, since 
k k k 

Z u (-n Z u.) = -n Z u (1-u -u.-u.-u. .) 
aj»1,j a b?«i ,j D a^i,j a a 1 J k l 
b^a 

k p 2 

= n[ Z u -(l-u.-u.-u. .) ], 

a^i ,j a i J k+l 

the complete-data term in (4D.13) is 


(4D.80) 


n{( ui + Vi )I(1 ' u i ) ' ( VVi ) ‘( 1 ‘ u r u j‘Vi ]+u j [ ' u i' (1 ' u r u j' u k+i ) 

t(u j +u k+i )(1 ' u j )/u j 1+(1 ‘ u r u o' u k+i )t ' u r ( V u kti ) ' (1 ' u i" u j' u k+i )+111/u k+i 2 

(4D.81) 


= n/u 


k+1* 


Noting that we will find no contribution for the case D^k+l,i,j, 
we divide each of the ten sums over sets D in (4D.13) into the five 
cases: D^k+1, D^i.j; D?k+1, D>i,j; D3k+l,i, D^j; D?k+l,j, D^i ; 

D*k+1, D^i.j. 
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Doing so and then combining results for the ninth and tenth sums, 
we have for the five cases that 

D> ^ 1 <V u D ) (Vl 2u D>' 1[ - (u 1 +u j )(u D- u 1- u j) +( V U M )( V u i- u j ) 

DM ,j 

+u i< u D- u ,— , j)-( u j +U k+l)( u D- U j) +,, j< U D- U 1- U j) +u 1 U j 

+ ( u i + Vi)( u D- u r u j) + ( u i +u kH )(u j +u M)-( u i t Vi)( u D- u i ) 1 

= Z (z n /u D )/u n , (4D.82) 

D?k+1 u u u 

D^i ,j 


D^ + iZ D /( u D u k+ i) [-(u i ^ k+1 )(u D -u 1 -u J -u k+1 )+(u j +u k+1 )(u D .u 1 -u j -u k+1 ) 
D*i , j 


+Ul(u 0 -u r u r u k +1 )-(Y Uk+ l )(u O- l, j )tu o (u O- u r u j- u kH> +u 1 u j 

+ ( u i +u kn) (u D- u r u j- u k + i) + < u i tu kti)( u j +u k + i>- (u i +u k+i )<u D- u i)l 

= - Z ( Z q/u d ) /li . , , s (4D.83) 

D3k+1 K 1 

Dai ,j 


Z 

Dak+l.i 


V (u o u k+i>‘ 


[-(VViXVWi 

+ <Wl )( VVVl 


)* u i (u D -u j - u k+ 1 ) 

) - t u D - u i ) C u j +u k+l 


)] 


= - z 


D?k+1 

Ofj 


( z D /u D )/Uk + i, 

> i 


(4D.84) 
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•l u j +u y.+l 


(40.85) 


nd , for ^ ^ 5t CaSe> ^ z 0 (o 0 -o^ ,U 0 2yUk+1 (40.86^ 

^ & #$ 


.. ( W^ V ^i*-**- 

0^‘V ... , e have for , nW »1 
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th e tradftrooal 
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4D.5 Examples Allowing Exact Solution : 

When the nonlinear system (4D.64) of equations for the asymptotic mean 

involves a polynomial in the mean components of degree less than 5, then an 

exact algebraic solution exists for the asymptotic mean and, hence, for the 

asymptotic posterior covariance matrix. In this section, we give three 

examples. For the first two examples, we give exact algebraic solutions as 

well as numerical evaluation for a data set. The second example requires 

use of the MACSYMA symbolic computer system. We conclude the section with 

a numerical example for the most general case for the trinomial distribution. 

This general case requires solution of a 5-degree polynomial. For one data 

set, we use MACSYMA to evaluate the five roots. The usual probability 

3 

constraints O-p.-l and I p.=l, along with the nature of the data, preclude 
1 i=l 1 

all solutions but one. 

Note that the analysis in this section holds for the posterior mode 
and the Taylor-series approximate posterior mean as well as for the asymptotic 
posterior mean, which is the maximum likelihood estimate. In general, we 
do not use the exact solutions because they are too expensive and, as just 
discussed, hold only for special cases. Instead, we use the EM iterative 
algorithm of Dempster, Laird, and Rubin (1977) discussed in Section 2.3.2 
to evaluate elements of the maximum likelihood estimate (hence, the asymptotic 
posterior mean), posterior mode, and Taylor-series approximate posterior 
mean. 

For this section we drop the braces in the set notations (i). Hence, 
we write z i rather than z^. 

For the first example, we calculate the asymptotic mean and covariance 
matrix of p given incomplete trinomial data z^z^^.z^.z^) . Expression 
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(4D.64) gives two equations 


U 1 = (z l +z 12 u l /u 12 )/n 

u 2 = ( z 2 +z i2 u 2 /u 12^ /n 

to solve for the two unknowns u^ and u 2 - 
Uo = l-u 1 -u =z./n. Solving (4D.87) for u« 


(4D.87) 

Note that u^+u^l-z^/n and that 
and u^ yields that 


u : = z 1 [l+z 12 /(z 1 +z 2 )]/n 

and (4D.88) 

u 2 = z 2 [l+z 12 /(z 1 +z 2 )]/n. 


From (4D.65) and (40.66), elements of the asymptotic inverse posterior 
covariance matrix are 

11 2 

S = n(u 1 +u 3 )/(u 1 u 3 )-z 12 u 2 /[u 1 (u 1 +u 2 ) ], 

S S * * * * * * 12 = n/u 3 +z 12 /(u 1 +u 2 ) 2 , (4D.89) 

and 

22 2 

S = n(u 2 +u 3 )/(u 2 u 3 )-z 12 u 1 /[u 2 (u 1 +u 2 ) ]. 

For data having values z^=105, z 2 =98, z 3 =200, and Zj 2 =200, evaluation 
of (4D.88) and (40.89) yields 

u^ = *35, u 2 = .32, u 3 = .33, 

and (4D.90) 

S 11 = 3,142.9, S 12 = 2,272.8, and S 22 = 3,224.3. 

Hence, elements S„ of the asymptotic posterior covariance matrix S have 


values 
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S n = 6.4900" 4 , S 12 = 4.5748"\ and S 22 = 6.3261 -4 . (4D.91) 

From (4D.91) the standard deviations vS^=.025 and /> 22 =.025 are 7.1% 
and 7.6% of and u 2# respectively. 

For the next example, consider incomplete trinomial data z=(ZpZ 2> 
z^z^z^). From (4D.64) the asymptotic mean is the solution of the 
following nonlinear system of equations 


U 1 = < z l +2 12 u l /u 12 +z 13 u l /u 13 )/n 

and (4D.92) 

u 2 = ( z 2 +z 12 u 2 /u 12^ /n 


where u 3 =l-Uj-u 2 [ =( z^+z 1 2 u 3 / u ^ 3 )/n] . Substituting for u 3 in (4D.92) 
and solving with MACSYMA yields in the following Table 4D.1 the three 
algebraic solutions for U! and u 2 - 

Substitution of data z^=100, z 2 =200, z 3 =200, z^ 2 =200, z^ 3 =200, and 
n=900 into the three solution sets yields the three solutions Uj=u 2 =0; 
Uj=u 2 =1/3; and u^=-l/3, u 2 =2/3. Consideration of the constraint U..-0 
eliminates the third solution. Consideration of the data eliminates the 
first solution. Therefore, there is only one satisfactory solution; 


Ui=U2= u 3=1/3. 

Note that results given in Table 4D.1 were expensive to obtain and 
utilized the maximum amount of computer memory available. Yet, these 
results were for only two patterns of incomplete data. Further, each of 
these patterns ({1,2} and {1,3}) involved only two categories (C^.C^ and 
Cj,C 3 , respectively). A total of only three variables (p^, p 2 , and p 3 ) 
was involved. Hence, an algebraic solution can be obtained only in very 
special cases. 
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For the last example, consider incomplete trinomial data z=(zpZ 2 ,z 3 , 
2 12 * z 13 ’ Z 23 ) * ^ or e ^ ements of the asymptotic mean and inverse covariance 
matrices, equations (4D.64) - (4D.66) yield 


U 1 = (2 l +Z 12W Z 13 U l /u 13 )/n> 


u 2 = ( Z 2 +z 12 U 2 /U 12 +Z 23 U 2 /U 23^ /n ’ and 


(4D.93) 


.11 


.12 


.22 


nu 13 / ^ U l U 3^" Z 23 U 2 / ^ U 3 U 23 )" Z 12 U 2^ U 1 U 12 )" z 13^ u l u 3^ ! 


n/u 3 +Z 12 /u 12 " Z 13 / ^ U 3 U 13^ Z 23 / ^ U 3 U 23^ 
nu 23 /(u 2 u 3 )-z 13 u 1 /(u 3 u 13 )-Zi2 u l^ u 2 U 12 ^' z 23^ U 2 U 3^ 


(4D.94) 


respectively. Note that (4D.93) is a nonlinear system of equations 
involving fifth powers of the means. Therefore, we do not obtain the 
exact algebraic solution. However, suppose that z^=3,000, z 2 =4,400, 

z 3 =10,000, z 32 =5,000, Zj 3 =3,400, and z 23 =4,000. Then, substituting 

these values into (4D.93) and setting u 3 =l-u^-u 2 yields, with the aid of 
MACSYMA, the five sets of solutions: 

u 1 =u 2 =0; u^O. 8151925, u 2 =0. 852957431; u^-0. 52547874, u 2 =0. 75930824; 

Ul =0. 20089479, u 2 =0. 29789739; and u^O. 73063732, u 2 =-0. 51744858. 

3 

Consideration of the constraints u.-O and £ u.=l eliminates all solutions 

1 i=i 1 

except the first and fourth. Consideration of the data eliminates the first 
solution. Therefore, the only satisfactory solution to (4D.93) is 

u^. 2008948, u 2 =. 2978974, u 3 =. 5012078. (4D.95) 

Substituting solution (4D.95) into (4D.94) yields for the asymptotic 
inverse covariance matrix S ^ the elements 
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S U = 1.4050 5 , S 12 = 5.9904 4 , and S 22 = 1.1638 5 , (4D.96) 

whence elements of the asymptotic covariance matrix S are 

S n = 9.1184" 6 , S 12 =-4.6934 -6 , and S 22 = 1.1008' 5 . (4D.97) 
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APPENDIX 4E 
ERROR PROPAGATION 


In this appendix, we study the error incurred when the iterative so- 
lution to an approximation 

P = G(p) (4E.1) 

is considered as a solution to the function 


P = 9(P) (4E .2) 

being approximated. In particular, we consider the approximation (4E.1) 
where 

G AP) = (z,- +v i + Z z D P i /P n )/ m » 1-i -k , (4E.3) 

1 ~ 1 1 D3i 1 u 


for the function (4E.2) where 


g (p) = (z +v )/m+ z z /m{p./p +E[3' (r ln )/(3p3p\)U+h.o.t.} 

1 ~ * 1 0 3 -j U i V ~ iw ~ ~ P 


(4E.4) 


= G. (p) + e.. 


In (4E.4), "h.o.t." denotes higher order terms in the Taylor-series ex- 
pansion of p about the exact posterior mean p [see Appendix 3B], where, 
however, evaluation of the partial derivatives is now made at p, not p. 

The term (r^) denotes the matrix of ratios r^^p^/p^. 

Note that no element of the matrices of partial derivatives is a 
function of the sample size n. For example, elements of [9 ( r -j q ) / ( 3p3p ' ) ]~ 
are given by, where 1 and j are elements of the set Q, 

9 r lQ /9p l = " 2 ( p Q' p l^ p Q ’ 
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and 


3 2 r 1g /( api3 Pj ) = (2 Pl -P p )/P 0 3 , 




2 


-2p-j/p 


3 

Q ' 


2 

For j<Q, 8 r 1 Q/(9p q 9p j .)=0 for any q. 

From (4.13), elements of the posterior covariance matrix £ are of 
magnitude 0(n *) and from Lemma 3B.2, elements of the higher order terms 
are of successively decreasing order of magnitude. Therefore, the error 
e i in (4E.4) is of order 0(n~''') for all i; i.e.. 


e. = 0(n -1 ) , l*i*k. (4E.5) 

We use the following lemma and proof derived from Theorem 3, page 
92, and Theorem 2, page 111, of Isaacson and Keller (1966): 

Lemma 4E.1 : Suppose, for l^i^k, that we have approximated p.=g.(-j&) by 

a function G. (p) in such a way that the error (p) in (p) is bounded 
by some value 6>0. Suppose, further, that we use the iterative scheme 
given by 

p^ S+1 ^ = G(p^) (4E.6) 

to calculate a root of G(p). Note that (4E.6) can also be written as 

p <S+1) = g(p (5 >) + f (s > (4E.7) 

where fe ' s ^| 4 s for all 1. 

From Appendix 3B, one root of g..(p) in (4E.4) is the exact posterior 

mean p, which we now study. Suppose that in all intervals || p-p l| co <P » 

where || p-p |L= max | p . -p . | and p>0, g(p) satisfies 
~ - i-i-k 1 1 

k 

max £ | 9g . (p)/9p . | - X < 1. 
i j=l 1 ~ J 


(4E.8) 
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Let the initial iterative estimate be any point in the p Q 
sphere l)p-plL" p o for 0 <p 0 -p-S/(l-A). Then the iterates p^ of (4E.7) 
lie in the interval || p-p l^-P and 

||p-piL - «/(l-X) + A S [p 0 -6/(l-X)] (4E.9) 

s 

where A -K) as s-*». 


£roof j[by jnjdu^tjon)^ 


By assumption, II p-p^ I^-Pq- Therefore, || p-p lu; |^p Q +6/(l-A)-p 
Assume that p^ for 1-1-s-l are in Hp-pl^-p. Then, 

IIp-p ( s) IL 4 II9 (p)-[9(p <s - 1, H s (s - 1) ]|U 
‘ ll[g(p)-g(p (s ‘ 1 ) )]|L + «• 




(4E.10) 


Now, for any two points p and p^ s- ^ in || p-p || -p, Taylor's theorem 
yields that 

k / • \ 

g. (p)-g-i (p) = T, 3g.(r 1] )/9p. (p .-p - ) , for l*1*k, (4E.11) 

1 ~ 1 ~ j=l 1 J J J 

where 5^ is a point on the open line segment joining p and p. Thus,' 
(i) 


' is in llp-pl^ and 


iQi (p)-g 1 (p) I - Z lag (C (l) )/3p,|x|p -p 
1 ~ ■ ~ j = l 1 J J J 


(i) 


- II P-P IL 1 |39i(v )/ 9 P n - 

~ ~ j = l 1 J 


(4E.12) 


- * IIp-pIL- 


Since the inequality holds for each i. 
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Hg(p)-g(p)lL - A c 4 e . 13 ) 

Therefore, from (4E.10) and (4E.13), 

IIH (S) IL £ x IIp {s_ 1 ) -pIL + 5 

- a 2 || p (s-2) -p IL + + 6 

* X 3 Hp^'^-pl^ + X 2 6 + X6 + 6 

* X s ||p^-p|L + X S_1 6 + ••• + X6 + 6 (4E.14) 

* X S p Q + 6[(1-X S )/(1-X)] 

* X S p Q + 6/(l-X) - X S 6/(1-X) 

* P 0 + 6/U-A) 

- P. 

Therefore, all the iterates p^ lie in Hp-pl^-P and the iteration 
process is defined. Finally, from the last inequality involving s, 

I|P-P (S) |L " 6/(l-X) + X S [p 0 -6/(l-X)]; (4E.15) 

|p i -P i (s) I * «/(l-X) + X S [p 0 -6/(l-X)] (4E.16) 

for all 1- i -k . 

This lemma shows that the exact posterior mean p satisfying (4E.4) 

A. 

can be approximated by the Taylor-series approximate posterior mean p 


from (4E.3) to an accuracy determined essentially by the accuracy of the 
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errors 6>e.(p)=g.(p)-G,.(p) in (4E.4). Thus, (p) small for all i 
implies that P^-P^ is small. 

From (4E.5), e..(p)=0(n *) for. all i. Thus, 6=0(n~'*'). From (4E.4), 

k k , 

max E | 3g • (p)/3p • | = max{ E z n /m[ E |3(p,/p n )/3p.| + 0(n -J -) ] } 
i j=l 1 ~ J i Dai u j=l 1 u 3 

= max E z /m[ | (p -p )/p 2 |+ E | -p /p 2 |] + 0(n _1 ) 
i Dai u u 1 u jeD 1 u 

j7i (4E.17) 

= max E z /m[(p -p )/p 2 + E p./p 2 ] + 0(n _1 ) 
i D3i u u 1 u jeD 1 u 

jVi 

= max E Zp/m{l/Pp+[g(D)-2]p^/Pp 2 } + 0(n *) 

for 3(D) the number of elements in D. 

In general, there is no guarantee that there exists a neighborhood 

of p in which (4E.8) is satisfied everywhere within the neighborhood. 

If there is, we call the largest such neighborhood the epm (exact-£oster- 

ior-mean) convergence region. [See the following Figure 4E.1 for an 

illustration of an epm convergence region.] 

Note, however, that for the trinomial distribution 8(D)=2. Hence, 

• <, 2 

the second term [6(D)-2]p^/Pg in (4E.17) is zero and 

k ,i 

max E |3g.(p)/3p.| = max E (z /m)/p + 0(n ). 

l*i«k j=l 1 ~ J l^i^k D>1 u u 

Further, recall from Sections 1.2, 2.2.3, and 4D.3 that z can be consid- 
ered as coming from related multinomial populations. For example, z= 
(z^^jZgjZ^) can be considered as coming from a trinomial distribution 
with v i =z p v 2 =z 2’ v 3 anc ^ a binomial distribution with y^z^ anc * ^3’ 
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where v 3 +y 3 =z 3> For n i2 =z l2 +y 3* then > z 12^ n = ^ z 12^ n 12^ n 12^ = ( n \ 

- 1/2 

><Pl 2 = (n^/ n )P^2 + ^p^ n )» w P ere P^2 t * ie rnaximum likelihood estimate. 

Therefore, for any incompletely specified data z^ 

z D /m = (n D /n)p D + 0 p (n" 1/2 ), (4E.18) 

so that, for the trinomial distribution, we can write (4E.17) as 

max 2 |3g (p)/3p | = max z (n./n)p n /p_ + 0(n" 1/2 ) . (4E.19) 

l^k j=l 1 ~ J i D=>i u D D p 

Because 2 n <n, 2 n n /n<l. Therefore, for large enough sample size n, 

03i u Dsi u 

the bound X = max Z |3g.(p)/3p.| is less than 1 if p- is close enough to 
i j=l 1 ~ 1 D 

Pp. Since for all values of p {which never has zero components because 

the prior parameter v never has zero components) there does exist a neigh- 
borhood such that p D /p D will be close to 1 for all values of P D in this 
neighborhood, for the trinomial distribution there exists an epm conver- 
gence region. For higher dimensions, however, there need not exist an 
epm convergence region and we give an example of such a case in the main 
text. Section 4.3.2. 

Observe that, anytime (4E.19) is satisfied, the term S/(l-A) in 
(4E.15) is 0(n _1 ). Since P Q is a constant, 6/( 1-X)=0(n" 1 ') , A<1, and, in 
particular, s can be assumed as large as desired, the term X s [p Q -6/(l-X)] 
in (4E.15) can be assumed to be zero. In particular, s can be assumed 
large enough that X s is small enough that this term is of magnitude no 
greater than 0(n"'*') . 

Therefore, if there exists a neighborhood || p-p |[ X> <P around p such 
that X in (4E.8) is satisfied and, further, the initial iterative esti- 
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mate p^ is chosen in the neighborhood ||p-p|( a < pQ-P- |S /(l-X)=p, then the 
error in the Taylor-series approximate posterior mean p is 0(n~^), i.e.. 

Pi = Pi + 0(n -1 ) . (4E.20) 
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Comments: Since <S=0(n *), for large enough sample sizes, the Pg neighbor- 

hood can be closely approximated by the p neighborhood. In turn, we can 
determine whether the iterates can be expected to be within the epm con- 
vergence region bounded by p, where condition (4E.8) must hold, by check- 
ing, first, whether the following inequality 
k 

max Z | 9g .(p)/3p.| = max Z z h /m{l/p n ^ S ^+[S(D)-2]^. (s) /[p n ^ S) ] 2 } < 1 (4E.21) 

i j=l 1 ~ J i D3i u u 1 u 


• ( $ ) 

holds for every iterate p. , q-s-t, for t+1 the number of iterations 
required for the convergence condition to be met and q the number of the 
first iteration that begins an unbroken succession of iterations satisfy- 
ing (4E.21). If (4E.21) does not successively hold after some number of 
iterations, then different initial estimates can be tried and inequality 
(4E.21) reevaluated. 

Second, if (4E.21) holds for sets of iterates converging to different 
values [i.e., to different roots of (4E.4)], more than one of which is in 
P^, we must determine which root, if any, is in the epm convergence region. 
[See Section 4D.5 for two examples of multiple roots, one having three 
roots and the other having five roots, for the asymptotic posterior mean 
for incomplete trinomial data.] As discussed in the main text. Section 
4.3.2, the global maximum within is conjectured to be the root that is 
in the epm convergence region or at least closest to p. Hence, of those 
iteration sequences satisfying (4E.21) and converging to different roots 
in P^, we choose that one for which the likelihood function 


V v -1 yv -1 vft+r 1 „ 5 Z D 
Pi p 2 p k+l 5 D 
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is a maximum. 

Note that the conditions on the partial derivatives and initial 

iterative estimate are sufficient but not necessary. Finally, we give 

\ 

three examples that show that Lemma 4E.1 gives very conservative bounds 

on the error Hp^-pi^ and on the guaranteed-convergence neighborhood of 

p. For these examples we use the data z=(2,5,6,4,2,0) given in Section 

2.2.3 where we calculated the exact posterior mean as p=( .2412, .3849, .3739) . 

For the first example, consider the neighborhood ||p-p|[ )o <p=.ll of p. 

For all probabilities p in this p neighborhood, max £ 1 9g • (P)/9P • | =(4/22) 

i D3i ~ J 

/Pl2 + (2/22)/pi3 < .56 < 1 and. a bound on the error made by approximating 
the exact posterior mean by a Taylor-series expansion is 6=0.035. Thus, 
6/(l-A)=0.080. Suppose that we choose an initial iterative estimate p^ 
in the region bounded by Pg-p-6/(l-X)=.ll-.08=.05. Then the iteration pro- 
cess is guaranteed to converge to within 6/(l-A)=.08 of the exact posterior 
mean. However, for any initial iterative probability (including that one 
whose three components each differ from the three corresponding components 
of p by .11) chosen within this p q neighborhood, the maximum difference 
between the converged iterative estimate and the exact posterior mean was 
0.003, more than 25 times smaller than the 6/( 1-A )= .080 error bound given 
by Theorem 4E.1. 

Now consider as an initial iterative estimate for p=( .2412, .3849, .3739) 
• (01 

the value p -(.05, .10, .85). For this value, 

2 

£ |3g.(P)/3p.r = (4/22)/. 15 + (2/22)/. 90 = 1.21 +.10 + 1.31 > 1. 
j=i ' - J 

Hence, conditions of Lemma 4E.1 are not satisfied. However, use of this 
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initial iterative estimate gives successive iterates, as shown in the 
following Table 4E.1, that do converge to within a small error of p. 

The initial iterative estimate failed condition (4E.8) because 
p^ = -15 was smaller than z ]^/ m "(n]^ m ^j5l2^12) • Note from this example 
that small values of p^ will be particularly troublesome in keeping the 
term (z D /m)/p D =(n 0 /m)p D /j5 D less than 1. 

In this example || p -p |^ jo =max ( . 19 » .28, .48) = . 48. Thus, the largest 
value of p for a guaranteed-convergence neighborhood of p must be smaller 
than .48. In the next example we choose as an initial iterative estimate 
p^ a probability p^ = ( .90, .07, .03) that is even further away from p. 

For this estimate, || p ^ 0 ^ -p || oo =max ( .66, .31, .34) = . 66. [See also Figure 4E. 2.] 
Since .66>.48 of the last example, this initial iterative estimate cannot 
be in a guaranteed-convergence neighborhood of p. Yet, for this estimate, 

2 

max Z |3g.(fr)/3p.| = (4/22)/ . 97+( 2/22) .93 = .29 < 1. 

1 j=l 1 ~ J 

Futher, as shown in Table 4E.1, the sequence of iterates arising from this 
initial iterative estimate also converges to within a small error of the 
exact posterior mean. 




H3S 
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TABLE 4E.1 


CONVERGENCE 

EXAMPLES 

FOR OUTSIDE 

INITIAL 

ESTIMATES 


SECOND EXAMPLE 

THIRD 

EXAMPLE 

8 1 2 

<Pl>6 

(P 2 ) e 

<*1>B 


0 

.0500 

.1000 

.9000 

.0700 

1 

.2020 

.3939 

.3930 

.2858 

2 

.2283 

.3929 

.2917 

.3493 

3 

.2374 

.3877 

.2598 

.3718 

4 

.2391 

.3870 

.2488 

.3798 

5 

.2413 

.3851 

.2448 

.3826 

6 

.2421 

.3845 

.2433 

.3836 

7 

.2420 

.3846 

.2428 

.3839 


1 

Initial iterative estimates chosen outside the 

guaranteed-convergence sphere of Lemma 4E.1 

for the exact posterior mean p=( .2412, . 3849, .3739) . 

2 

Iteration number 



CHAPTER 5 


SMALL-SAMPLE STUDIES OF APPROXIMATIONS FOR POSTERIOR MOMENTS 
AND OF ESTIMATORS FOR MINIMIZING QUADRATIC LOSS 

5.1 Introduction : 

In the last chapter, we showed that for large sample sizes the 
Taylor-series approximations should be very close to corresponding exact 
posterior moments. We now consider how well these asymptotic properties 
hold in small- and medium-size samples. We also compare the Taylor- 
series approximations with the posterior mode and maximum-likelihood 
estimate to determine which best approximates the exact posterior mean 
for these smaller sample sizes. Although all three approximations will 
be very close for very large sample sizes, we expect differences in the 
most commonly encountered sample sizes. 

We then turn to our main interest and report which of these three 
estimators best minimizes expected quadratic loss (risk) for specified 

Cl 

values of the Dirichlet probabilities. Note that we do not include the 
exact posterior mean in the risk study. Results from the approximation 
part of this small-sample study showed that there was no difference be- 
tween the Taylor-series approximation and the exact posterior mean that 
would alter conclusions from using the Taylor-series approximation for 
the exact posterior mean. Since the exact posterior mean becomes in- 
creasingly expensive as the sample size and/or percentage of incomplete 
data increases, we used the Taylor-series approximation for the exact 
posterior mean. Therefore, in our mnemonics, we refer to the Taylor- 
series approximation as APM (approximate jppsteri or mean). 
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In Figure 5.0 we give expressions for the three approximations for 
the exact posterior mean and estimators for minimizing quadratic loss. 
These equations were presented in Chapter 1 or derived in Chapter 3. 

The mnemonics APM, PMD (posterior mode), and MLE (maximum Vikelihood 
estimate) in parenthesis are used throughout these next three chapters. 
They are especially useful in presenting results in the next two chapters. 
For the risk study, we attach suffixes R0, Rl, and R2 to these mnemonics 
to denote the three robustness studies for use of the original, uniform, 
and perturbed priors, respectively, in the Bayesian estimators. 

In summary, we are interested in four main questions: (1) how 

well the Taylor-series expansions approximate the exact posterior mean 
and covariance matrices; (2) which of three estimators (Taylor-series, 
posterior mode, and maximum likelihood estimate) best approximates the 
exact posterior mean; (3) which of these three estimators best minimi- 
zes risk; and (4) how robust results from (3) are to use of the wrong 
prior in the Bayesian estimators. Because we were unable to solve these 
problems theoretically, we used Monte-Carlo simulation studies. Hence, 
results will be only indicative, not conclusive. 

In this chapter, we discuss designs and computational procedures 
for two Monte-Carlo studies. In the next two chapters, we discuss 
results from these studies. 
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FIGURE 5.0 

APPROXIMATIONS FOR EXACT POSTERIOR MEAN 
AND ESTIMATORS FOR QUADRATIC LOSS 


Taylor-Series (APM) 


Jt 



[z 



V. 


k+1 

+ E (p,/pj z ]/[n+ E v.] 
Dai 1 u u j=l J 


Posterior Mode (PMD) 


k+l 


p. = [z r .,+v.-l+ 
i {i} i 


E (p./pJ z n ]/[n+ E v .-(k+l) ] 
D3i 1 U U j=l J 


Maximum Likelihood (MLE) 


P 




z (p/M z n ]/n 

Dai 1 u u 


Note that k=2 for trinomial simulation study. Also note that 
braces in are henceforth dropped. 
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5.2 Special Notation and Mnemonics : 

Notajn_onj_ 

i ' element of Dirichlet generator probability vector 
t h 

p. i n element of exact posterior mean 

. .th 

p^ i element of Taylor-series (T.S.) approximate posterior mean (APM) 

t h 

i element of maximum likelihood estimate (NILE) 

A ^ 

p^ i element of posterior mode (PMD) 

t h 

p\j i ' element of complete -data maximum likelihood estimate (used 

mainly for variance reduction in estimating mean squared error) 

* t h • 

p. i element of dummy estimator p, which is used when describing 

properties or formula that pertain to more than one of the above 
estimators 

e.,. Pi -p i for any of above estimators pL, p.., and p\ 

A 

e. p.-p. for p. any of above estimators p., p., and p. 

l it r i i i l 

Note that we are using p to denote both any value of the simplex 
P 2 = {(PpP 2 >P 3 ) ; O-ppP^p^l; Pi + P 2 +p 3 = ^ and a particular value of P 2 - 
The context in which p is used should make clear the particular meaning. 
Further, note that both p and p are Dirichlet probabilities. The p 
either is set to the expected value of the Dirichlet distribution of p 
given v (note Design 1 in following Section 5.4) or is generated from 
this distribution (Design 2). In either case, we shall refer to p as 
the generator. The p refers to the posterior mean of the Dirichlet 
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distribution of p given the incomplete data z. Thus, for Design 1, 
p is the prior Dirichlet mean and p, the posterior Dirichlet mean. 

Mnemonics_: Note that the following mnemonics might appear in lower-case, 

as well as capital, letters: 


APC 

APM 

EPC 

EPM 

MLE 

MSE 

PID 

PMD 

SS 


Taylor-series approximate posterior covariance 

Taylor-series approximate posterior mean 

exact posterior covariance 

exact posterior mean 

maximum Vikelihood estimate 

mean squared error 

percentage of incomplete data 

posterior mode 

sample size 
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5.3 Criteria of Goodness : 

To determine how good an estimator was either for estimating an 
exact posterior moment or for minimizing quadratic loss, we used several 
criteria. The main criterion for judging the accuracy of approximations 
for the exact posterior moments was percent relative difference. To 
judge among the estimators for estimating the exact posterior mean, we 
also used mean squared error E[ (p-p) ' (p-p) ] . Of course, for judging 
which estimator best minimized quadratic loss, the criterion was the 
mean squared error E [ ( p-p) 1 ( p-p) ]. Estimates of mean squared error 
(mse) are discussed in Section 5.9. 

Additional measures of goodness were also considered in Chapter 6 
where we studied the estimators in detail. For example, among addi- 
tional calculations were the frequency distributions of the number of 
iterations, deviations, and percentage relative difference. Criterion 
of goodness are included in the listing of tables in Chapters 6 and 7. 
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5.4 Computer : 

Computers used for the simulation were a CDC (Control Data Corpora- 
tion) 6600 and Cyber 175 with 60-bit words. Single-precision calcula- 
tions were accurate to about 14.5 significant figures; double-precision 
calculations, to 29. The programing language was Fortran Extended, 

Version 4.6. To minimize execution cost, recommendations from the NASA, 
Langley Research Center "Computer Programing Manual", (1975, vl .sect. 8) 
were incorporated. 

Main incorporations were the passing of parameters among programs 
through COMMON rather than calling sequences and a reduction in a num- 
ber of otherwise large D0-L00P indices. Owing to the latter, program 
statements and number of variables increased. Number of dimensions on 
a variable decreased. Among other inclusions were use of "IF (A-B) 

10,20,20" instead of "IF (A.GE.B) 20,10", collapsed dimensioning for 
array initializations, and special procedures for arithmetic operations. 

Unless otherwise noted, all programs were written by the author. 

A listing of most of these programs is given in Credeur (1978). An 
index precedes the listing. 

; 
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5.5 Factors in Experiments : 

In investigating the four issues outlined in the Introduction, we 
were interested in the effects of variation in prior parameter v, the 
Dirichlet probabilities arising from the distribution of p given v, 
sample size (SS), and percentage of incomplete data (PID). 

Owing to cost constraints, we limited the number of these varia- 
tions. For percentage of incomplete data (PID) we chose 15 and 40. We 
already knew from Chapter 3 that for 0 % incomplete data, the Taylor- 
series approximations (APM and APC) exactly equaled the posterior mean 
and posterior covariance, respectively, whereas the posterior-mode 
(PMD) and maximum-likelihood-estimate (MLE) approximations did not. 

Thus, for investigating the first two introductory questions concerning 
estimators for the exact posterior mean and covariance (EPM and EPC) 
matrices, we essentially had PID for values 0, 15, and 40. 

For sample size (SS) we chose 25 and 50. For these values and 
ranges of PID we were able to calculate the exact posterior mean and 
covariance matrices. As noted in Chapter 2, for sample sizes much 
larger than 50, calculations for the exact values would be expensive, 
especially for those cases in which PID=40. 

To set values of the prior parameter v, we first considered values 
we wanted for the Dirichlet probabilities arising from the distribution 
of p given the prior. We wanted roughly to cover the range of probabili- 
ties from (0,0,1) to (1/3, 1/3, 1/3) . We picked four values (.01, .01, .98) , 
(.10, .10, .80), (.20, .30, .50), and (1/3, 1/3, 1/3) as focal points to be 



-192- 


in vesti gated . Now, usually one has a prior because one has a prior 
sample. If the size of the prior sample is small relative to the size 
of the current sample, then the prior has little effect on the estimators. 
If the prior-sample size is relatively large, the current data has 
little effect. Therefore, because we chose current sample sizes of 25 
and 50, we set the size of the prior data at 10, two-fifths and one- 
fifth the current information, respectively. Thus, values of the prior 

parameter v were chosen as 10 times the prior mean we wanted. 

3 3 

That is, since E(p.|v)=v-/ E v- and E v-=10, then 

1 ~ 1 j=l J j=l J 

v.j = 10 x E(p.. |v). (5.1) 

Setting E(p|v) to the four focal points gave values of v as ( . 1 , . 1 ,9 . 8) , 

(1,1,8), (2,3,5), and (10/3,10/3,10/3). 

The simulation study was done in two stages, as follows. In the 
first stage, which we called Design 1, we fixed the value of the 
Dirichlet probability at the expected value of the distribution of p 
given each one of the four prior parameters v. In the second, Design 2, 
we generated 10 values of the Dirichlet probability from each of the 
fixed values of v. Designs 1 and 2 are illustrated in Figures 5.1 and 
5.2, respectively. A summary design is given as Figure 5.3. 

Results from Design 1 allowed at least some of the four Introductory 
questions, especially those concerning the exact-posterior-moments com- 
parisons to be satisfactorily answered. Because cost was less, more 
details were studied. The second design, Design 2, allowed us to 
determine how Design 1 results were affected by our choosing a special 
probability, the expected value of p given v. As we moved away from the 
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figure 5.1 

DESIGN 1 


LEVEL 8 + 
Dlrichlet 
p variation* 


LEVEL C 
X incomplete 
data variation 


LEVEL D 
sample size 
variation 


# of 
levels 

[v=(0. 1,0. 1,9.8)] [v=(1.0, 1 .0,8.0) 3 (v={2. 0,3. 0,5.0)] [v=(10/3, 10/3, 10/3)1 4 

(.01, .01, .98) p.*( .10, .10, .80) p 3 s (.20» .30, .50) p 4 = ( 1/3 , 1/3 , 1/3 ) 



LEVEL E (x,z)-*-(x,z) (x,z) • • • (x,z) 

trinomial-data 2 replic.; 
generation 200 trials 
x = complete data per replic. 


z = incomplete data 


+ level A is not present in this design (see Design 2 and Summary Design) 

* 

p is expected value of Dirichlet probability distribution given v 

This design yields 6400 [=4x2x2*200*2repl ] data sets and requires generation of 240,000 [=4x2x(25+50)x200x2repl ] 
uniform random numbers. 

This design constitutes sets of full factorials: 

2 

a. for epm comparisons: 4x2 x 3 with 2 replications per cell (last factor level 3 refers to estimators apm, pmd, and mle) 

2 

b. for quadratic-loss comparisons: 4x2 x3 with two replications per cell (last factor level 3 refers to estimators apm, 

pmd, and mle) 
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FIGURE 5.2 
DESIGN 2 


LEVEL A 

prior- parameter 
variation 


LEVEL 8 
Dirichlet 
p generation 


v*(0. 1,0. 1,9.8) 



PlO 


, LEVEL C 
'/.incomplete- 
data variation 


LEVEL D 
sample size 
variation 


LEVEL E 

trinomial -data 

generation 

x = complete data 

z = incomplete data 


v=(l. 0,1. 0,8.0) v=(2. 0,3. 0,5.0) 



(x,z) • • • (x,z) (x,z) ■ • • (x,z) 


v=(10/3, 10/3, 10/3) 



Bl * ' * -10 


9 of 
levels 

4 


10 


2 


2 


2 replic. ; 
200 trials 
per replic. 


This design yields 64,000 f =4'10"2''2'200'2repl ) data sets and requires generation of 2.400,120 random numbers (ie.; 120=4*3**10 gairma random 

variables for 40 3-dimensional Dirichlet random variables ♦ 2,400,000 = 4*10*2* (25+50)*200*2repl . uniform random numbers], 

each Dirichlet p requires generation of 3 gamma random variables 

This design constitutes sets of nested factorials: 

2 

a. for epm comparisons: 4-10*2 *3 with two replications per cell 

2 

b. for quadratic loss: 4-1 0^2 '3 with two replications per cell 
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figure 5.3 
SUMMARY DESIGN 


LEVEL A 

prior-parameter 

variation 


LEVEL B 
Oirichlet 
p generation 


LEVEL C 
* incomplete- 
data variation 


LEVEL D 
sample size 
variation 



LEVEL E (x,z) (x,z)(x,z) ••• (x,z) 

trinomial-data -- - - * * - - 

generation 
x » complete data 


z = incomplete data 


# of 
levels 

4 


10+1 


2 


2 


2 replic.; 
200 trials 
per replic. 


Total designs yield 70,400 [=4xllx2x2x200x2replic] data sets and require generation of 2,640,120 random numbers 
( ie. ; 120=4x3*xl0 gamma random variables for 40 3-dimenslonal Dirichlets +2,640,000 = 4xllx2x(25+50)x200x2replic 

uniform random numbers). 

*each Oirichlet p requires generation of 3 gamma random variables 
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expected value in Design 2, using probabilities randomly generated 
from fixed v, how did Design 1 results change? 

To measure the variation in the probabilities p associated with a 
prior v, define a centrality norm 

2 3 

C(p) = £ £ (p.-pj 2 . (5.2) 

i=l j>i 1 3 

Values of C(p) for Design 1 are given in the following Table 5.1. Note 
from Table 5.1 that as p moves from a corner of Pg toward its center, 
C(p) decreases from 2.00 to 0. Centrality measures for generated 
Dirichlet probabilities in Design 2 are given in Table 7.1 in Chapter 7. 


TABLE 5.1 

CENTRALITY MEASURES FOR DESIGN 1 


V 

E(p|v) 

C(p) 



~ 

(0.1,0. 1,9. 8) 

(.01, .01, .98) 

1.88 

(1.0, 1.0, 8.0) 

(.10, .10, .80) 

.98 

(2. 0,3. 0,5.0) 

(.20, .30, .50) 

.14 

(10/3,10/3,10/3) 

(1/3, 1/3, 1/3) 

.00 


Factors SS and RID were quantitative; we considered v and p to be 
qualitative. In Design 1 all factors were fixed. In Design 2, p was 
random and remaining factors were fixed. 

Once we fixed the factor levels, we generated the trinomial data. 
In the next section, we discuss how we chose the number of trinomial 
simulations and, in Section 5.7, how we generated the data. To allow a 
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control variate and thus a better mean-squared-error (mse) estimate 
for the risk study, we generated complete as well as incomplete data. 

For the exploratory robustness study, the two priors used besides 
the original v prior were the uniform prior and a perturbed prior. 
Values of both are given in the following Table 5.2. The uniform prior 
is frequently used when one is uncertain of previous information. It 
gives equal weight to all three trinomial categories. The perturbed 
prior not only differs in magnitude from the correct prior v but does 
so in a skewed manner. The change to the first component is 



TABLE 5.2 



PRIORS FOR ROBUSTNESS 

STUDY 

Robustness Set 

Type _ 

Value 

RO 

original 

V 

R1 

uniform 

(1.1.1) 

R2 

perturbed 

10x[v/10+(.09,.05,-.14)] 


approximately twice that to the second component and two-thirds that 
to the third component. The first two components increase; the third 
decreases. 
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5.6 Determination of Number of Simulation Trials and Replications : 

Because mean squared error was the major overall "goodness" measure, 
the main criterion for choosing the number of trinomial simulation trials 
was that the standard errors of the average estimated mean squared errors 
be small relative to the difference between the mean squared errors. For 
this purpose, 200 trials was enough; for just comparisons among approx- 
imations for the exact posterior mean, fewer trials would have been 
needed. 

As noted in Section 5.3, we were also interested in the deviations 
of the estimators from the exact posterior mean (the "EPM deviations"). 
One deviation, or error, measure was the average. However, the number 
of simulation trials needed to make the standard errors of the average 
deviations small relative to the difference between the average devia- 
tions was prohibitively expensive. Results of Design 1 gave that the 
average APM deviation was a couple orders of magnitude smaller than the 
average PMD and MLE deviations. Hence, the difference between it and 
either of the average PMD or MLE errors approximately equaled the PMD 
or MLE deviation, respectively. However, even for a number of trials 
as large as 200, the standard error of the average deviation roughly 
equaled the respective average deviation. (All EPM deviations averaged 
zero to varying number of decimal places.) Therefore, the APM-PMD=PMD 
and APM-MLE-MLE differences were not always larger than the standard 
errors of the PMD and MLE average deviations. (They were, however, 
much larger than the standard errors for the average APM deviation.) 
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To estimate the experimental error in estimating averages, 
including the mean squared errors, we repeated each set of 200 trinomial 
simulation trials once. [Recall Figures 5.1 - 5.3.] Cost considera- 
tions precluded more than two replications. Although each of the 200 
trinomial simulations can be called a replication, for differentiation, 
we reserve this term for these two repetitions. The two replications 
also provided another check that 200 trinomial simulations were enough. 
There was little difference between results for each of the two 
repl i cations. 

To determine the number of simulation trials to use in generating 
Dirichlet probabilities in Design 2, we were guided mainly by cost 
constraints. We took only 10 trials. Results of Chapter 7 show how 
surprisingly good 10 trials were in terms of theoretical expectations. 
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5.7 Data Generation ; 

For Design 1 we must generate complete and incomplete trinomial 
data. For Design 2 we must also generate Dirichlet probabilities. 

5.7.1 Uniform Random^N^mb_er Generator.: 

As do most other generator algorithms* algorithms to generate 
trinomial and Dirichlet data depend on a uniform random-number generator. 
For this generator, we used the multiplicative congruential generator 

x. = 43490275647445 x._ 1 mod(2 48 ) (5.3) 

from Ahrens and Dieter (1974, p223). Uniformly distributed variables u^ 
were then calculated by 

u. = x i /2 48 . (5.4) 

The multiplier 43490275647445 is congruent 5 mod(8); therefore, 
from Knuth (1969,pl8,93) , the generator (5.3) has maximum period of 2 46 
and we can apply the Spectral test of Coveyou and Macpherson (1967). 

The Spectral test is currently the most powerful test of the randomness 
of a random-number generator. By using a computer program written by 
Golder (1976, pl73) with corrections by Hoaglin and King (1978), we 
calculated the Spectral Numbers , 2*i*5, as 

C 2 =2.839, c 3 =2.095, C 4 =1.819, c 5 =0.987. (5.5) 

Since c 2 , c 3 , and c 4 all exceed 1 and c 5 is almost 1, the generator is 
very good in terms of the Spectral test, a theoretical test. Therefore, 
it is most likely good in terms of any empirical tests. 
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As an empirical check on the generator, however, we ran a number 
of 95% Confidence-Interval tests on the sample means and standard 
deviations, chi-square tests, serial -correlation tests, Kolmogorov- 
Smirnov tests on the cumulative frequency, and did plots on the density 
and the cumulative distribution. The generator did well on all. 

5.7.2 Di ri chj_e;t Ran_dom^N umber. Generator.: 

To generate a Dirichlet random vector, we used the following 
theorem from Wilks ( 1963 ,pl79) : 


"If Xj are independent random variables having gamma distri- 
butions G(v 1 ) G ( v k+l)* then ^ or 


y i - ^ ^ ( x^+. . . +x^^_ j ) , 1-i k. 


(5.6) 


(yj y k ) has the k-variate Dirichlet distribution D(vj v k iv k+l^‘" 


Therefore, to obtain one random vector p, from a Dirichlet distribution 
with k=2, we must generate three independent gamma random variables. To 
do so, we used algorithm GT from Ahrens and Dieter (1974, p229). 

We checked the Ahrens-Dieter GT algorithm by doing 95% confidence 
limits on the sample means and standard deviations, plots on the density 
and cumulative frequency, and Kolmogorov-Smirnov tests on the cumulative 
distribution. We then performed these same tests on the Dirichlet 
probabilities p calculated from these gammas. 

Other than the standard deviations, the generators performed well. 
As shown in the following Table 5.3, for the gamma random variables, the 
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table 5.3 

PERCENT REJECTIONS IN 95% NORMAL 1 CONFIDENCE INTERVALS FOR GAMMA AND MARGINAL DIRICHLET 

RANDOM VARIABLES 

(300 TRIALS, 200 OBSERVATIONS PER TRIAL, EXPECTED PERCENTAGE IS 5) 


PRIOR PARAMETER 

SEED 

GENERATOR 

Ml 

M2 

M3 

SD1 

SD2 

SD3 

(0.1, 3. 5, 6. 4) 

21153 

Gamma 

5% 2 

3% 

4% 

77% 

11%“ 

9% 


Dirichlet 

5 3 

4 

4 

69 

3 

3 

(0.1,0. 1,9.8) 

21197 

Gamma 

5 

4 

4 

71 

70 

9 


Dirichlet 

5 

5 

5 

68 

69 

45 

(0.1, 0.1, 9. 8) 

21153 

Gamma 

4 

6 

5 

73 

72 

9 


Di richlet 

3 

5 

5 

68 

68 

56 

(0.5, 0.5, 9.0) 

21143 

Gamma 

4 

4 

4 

44 

44 

10 


Dirichlet 

5 

5 

5 

34 

31 

17 

(1.0,1. 0,8.0) 

31153 

Gamma 

2 

5 

4 

26 

35 

13 


Di richlet 

3 

4 

4 

19 

20 

9 

(10/3,10/3,10/3) 

21153 

Gamma 

5 

5 

4 

14 

16 

17 



Dirichlet 

2 

5 

3 

3 

4 

5 

(2. 5, 3. 0,4. 5) 

22213 

Gamma 

3 

4 

3 

20 

17 

15 


Dirichlet 

4 

7 

6 

3 

3 

3 


21113 

Gamma 

6 

6 

2 

17 

19 

13 



Dirichlet 

5 

4 

3 

6 

6 

5 


21111 

Garnna 

5 

6 

4 

18 

21 

10 



Dirichlet 

5 

5 

5 

6 

5 

2 


21313 

Gamma 

3 

6 

3 

14 

16 

13 



Di richlet 

3 

2 

2 

3 

4 

1 


21153 

Gamma 

2 

4 

4 

15 

13 

11 



Dirichlet 

5 

3 

5 

4 

2 

3 


1 Normal approximation is used for the confidence intervals. 

2 

In 300 trials (sets of generations), 200 observations per trial, from gamma(O.l), 
the sample mean Ml (calculated over 200 observations) fell outside the 95% normal 
confidence interval 5% of the time (approximately 15 of the 300 trials). 

In 300 trials, 200 observations per trial, from beta(0. 1,3. 5+6. 4)=beta(0. 1,9.9) , 
the sample mean Ml fell outside the 95% normal confidence intervals 5% of the 
time. 

4 

In 300 trials, 200 observations per trial, from gamma(3.5), the sample standard 
deviation (calculated over 200 observations) fell outside the 95% normal confidence 
intervals 11% of the time. 
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standard deviations routinely exceeded the 95% confidence limits more 
than 5% of the time and became increasingly worse as we moved from 10/3 
in the center of the 2-dimensional simplex Pg to 0.1 or 9.8 at a corner. 
The same trend was observed for standard deviations of the marginal 
Dirichlets except that the percentage of rejections was much smaller and 
the standard deviations were good for points away from the boundary. 

This behavior may be due either to (1) a poor fit of the generated data 
to the theoretical curve or (2) to the normal approximation, which we 
used, for the confidence intervals for the standard deviations being 
poor for the sample sizes we used. 

For three reasons, we accepted the latter explanation. The first 
reason is that, as noted, for probabilities away from the boundary, 
marginals from those Dirichlet random vectors generated from these gammas 
cMd have standard deviations falling in the 95% confidence intervals all 
but 5% of the time. [See results in Table 5.3 for non-boundary probab- 
ilities corresponding to prior parameters 3.5, 6.4, 10/3, 2.5, 3.0, and 
4.5.] The second reason is that the gamma and the Dirichlet marginals 
performed well on the other tests (and the gamma generator had been 
studied by Ahrens and Dieter). The third reason is that the sample 
kurtosis for those random variables near a boundary was very high. 
Therefore, from Snedecor and Cochran (1968, p. 89), the variance of the 
sample variance was much larger than it would have been had the popula- 
tion been normal. One calculation gave that the variance of the sample 
variance over 300 trials, 200 observations per trial, for a gamma 
generation of 0.1, was about 16.5 times as large as it would be in a 
normal population. Hence, we could not expect the normal approximation 
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for the 95% confidence limits for the standard deviations to be good 
in these cases. 

Therefore, we accept reason (2) [that the normal approximation for 
the 95% confidence intervals for the standard deviations was poor] for so 
many standard deviations falling outside the 95% confidence intervals, 
especially for boundary probabilities corresponding to prior parameters 
0.1 and 9.8. We did not calculate the exact standard deviations (and 
then test them). However, since the remaining tests (and Ahrens and 
Dieter's work for the gamma) showed that the gamma and, more important, 
the resulting Dirichlet variables were well generated, we considered 
the Dirichlet random-number generator to be good. 

5.7.3 Trinomial Rand^m^Number Generator; 

Given some value of p and some percentage PID of incomplete data, 
we next generated the trinomial complete data x=(xj,x 2 ,x 3 ) and incomplete 

data z = ( z i« z 2 ,z 3» z i2* z i3, z 23) * 

We first recalled that PID/ 100 is simply the probability that an 
observation was incompletely classified. Second, given that an observa- 
tion was incomplete, the probability that it was unclassified between Cj 
and C 2 (i.e., the observation fell in C^) was ( p^+p 2 ) / [ ( p ^+p 2 ) + ( Pi + P 3 ) 
+(P 2 +P 3 ) ]=(Pi+P 2)/2. Therefore, the probability that an observation was 
incomplete and simultaneously fell in Cj 2 was (PID/100) (pj+p 2 )/2. 
Similarly, probabilities for and C 23 were (PID/100)(p^+p 2 )/2 and 
(PID/100)(p 2 +p 3 )/2, respectively, and the probability that an observation 
was completely specified and fell in Cj, C 2 , or C 3 was (l-PID/lOOjp^, 
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( 1-PI D/ 100) p 2 » or (l-PID/100)p 3> respectively. 

Therefore, to generate incomplete data z according to the likelihood 
equation (2.8) for n observations, we could draw n uniform random numbers 
u.j , 1-i-n. We would then use these six probabilities to establish inter- 
vals that determined where an observation fell. For example, if 
0-u^<( 1-PID/ 100 ) p ^ , we would increment z^ by one and, if (l-PID/100)-u^< 

( 1-P I D/ 100) +( p j+pg ) /2 X PI D/ 100 » we would increment zl2 by one. 

However, we also wanted to generate complete trinomial data x for 
use in Section 5.9. Therefore, we had to divide zl2, zl3, and z23 into 
proportions that fell into completely specified categories C^, C 2 , and 
C^. To do so, we noted that if an observation fell in C^, then with 
probability Pj/C P^ + P2) it belonged in similarly, for C 13 and C 23 . 
Therefore, we divided the zl2, zl3, and z23 intervals, exampled in the 
last paragraph, into two by the ratios p^/Cpj+p^, p 1 /(p 1 +p 3 ), and 
p 2 /(p 2 +p 3 ), respectively. 

Finally, we set to 0 each element of the complete data x and the 
incomplete data z. We then created dummy variables y^, y 2 , w^, w 3 , v 2 , 
and v 3 and initialized them also to 0. From the uni form- random-number 
generator described at the beginning of this section, we drew n uniform 
random numbers u. , 1-i-n. Then, letting h=PI D/100 and P-jj = Pj + Pj» 
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_if 

0~ u i<P 1 (l”h) 

P 1 (l-h)-u i <p 1 2(l-h) 

p 12( l- h )~ u i <l" h 

l-h-u i <l-h(l-p 1 /2) 
l-h(l- Pl /2)^.<l-h(l-p 12 /2) 
l-h(l-p 12 /2)^u i <l-h(l-p 1 -p 2 /2) 
l-h(l-p 1 -p 2 /2)^ i <l-h/2(l-p 1 ) 
l-h/2(l- Pl )^u.<l-h/2(l-p 12 ) 
l-h/2( l-p 12 )*u i <l 

At the end of this process 


add 1 to 


Z 1 

Z 2 

z 3 

*1 

*2 

W 1 

w 3 


calculated the complete data x as 


x i = 

x 2 = W v 2 

x 3 = W v 3’ 

and the incomplete data z as 



z 12 = *1^2 
z 13 = w l +w 3 
z 23 = v 2 +v 3 ‘ 


(5.7) 


(5.8) 


This trinomial random-number generator performed well on the same 
kind of empirical tests used for the uniform, gamna, and Dirichlet 
random-number generators. Note that to perform empirical tests on these 
four generators, we used routines from the NASA, Langley Research Center, 
and the IMSL (International Mathematical and Statistical Ubraries, Inc.) 
computer- pro gram libraries. 


-207- 


5.8 Iteration Considerations : 

In this section we discuss the following considerations concerning 
the iterative algorithms: initial estimate, convergence criterion, 

problems, and conditions for convergence. 

Note here that we used the method that is noniterative in 5 for 
approximating elements of the exact posterior covariance matrix. 

5.8.1 Ini_tial_ E_s t. i mate_^ 

To use iterative algorithms for the maximum likelihood estimate, 
posterior mode, and Taylor-series approximated posterior mean, we needed 
initial estimates. Because a major concern of this work was approximating 
the exact posterior mean, we used the exact posterior mean for the ini- 
tial estimate. Thus, the number of iterations for convergence was 
another measure of which estimator best approximated the exact posterior 
mean. 

5.8.2 Convergence Criteri_on_: 

In general, the convergence criterion was 

absfp.^^-p^^)/^^ ^ 0.0001 for 1-1,2 (5.9) 

for p^ denoting one of APM, PMD, and MLE and £, denoting the number of 
iterations. 

This criterion gave stability in the p^ estimate to at least three 
significant figures for all cases and to at least four significant 
figures for nearly all cases. The expected p for Design 1 were ordered 
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so that the first two of the three components were less than 0.50. 

Hence, for most cases the absolute difference between successive itera- 
tions for the first two components of an estimate was less than 
0.0001x0.50 = 0.00005 and thus the estimate was stable to the fourth 
significant figure. The exceptions, which were accurate to the third 
significant figure, involved those relatively rare cases resulting from 
generated trinomial data yielding estimators having one of their first 
two components greater than 0.50. 

An artificial example of these exceptions would be trinomial data 
generated from P 2 =( .20, .30, .50) that yielded an estimator p^=( .10,.60, 
.30). The largest absolute difference (acceptable for convergence) 
between p 2 ^ and p 2 ^ + ^, the second component of p^ + ^ , would be 
.00006 - .0001; ie, the fourth significant figure would be off by at 
most 1. 

To avoid division by 0 (infinite result) and other small numbers 

• (l) 

(possibly long iterations), whenever p^ ' was less than or equal to 
0.10, we used the convergence criterion 

absfp.^^-p.^) * 0.00001 for i=l,2. (5.10) 

This criterion was equivalent to the first one (5.9) for p.^ = 0.10. 
5.8.3 £ondi_tion^s_for_Con^er3en£e^ 

Recall from Sections 2.3 and 4.3.2 that the EM algorithm converges 
in to a solution of the likelihood equation if the eigenvalues of the 
covariance matrix of the complete-data sufficient statistics are bounded 
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above zero. Hence, under these conditions the posterior mode and 
maximum likelihood estimate converge to at least a local maximum. Since 
the Taylor-series approximate posterior mean can be written as a pos- 
terior mode (for the prior 3=v+l), it also converges to at least a local 
maximum. The question, however, for the Taylor-series approximate 
posterior mean is whether it converges to the exact posterior mean. 

This question also applies to the maximum likelihood estimate and to the 
posterior mode when they are used as approximations of the exact pos- 
terior mean. 

In Appendix 4E we addressed this question and determined conditions 

under which an iterative solution to the Taylor-series approximate pos- 

• ^ 

terior mean p agrees with the exact posterior mean p within a small 

“v# ~ 

bounded error. We proved that if there exists a neighborhood 

p-p |j = max |p.-p. |<p, for p>0, of the exact posterior mean such that 
~ ~ °° 1-i -k ) 1 

for all values p in this neighborhood 

k . . 

max E |3g.(p)/3p.| - A < 1, 
i j=l 1 ~ J 

where 

. . k+1 

9i (p) * (z.+v^E < z D P i /P D )/(n + 

• ( n} 

and an initial iterative estimate p.. v ' is chosen within the inner sphere 
IIp-pIL-Pq, for th-is neighborhood, then the Taylor- 

series approximation will converge to within 6/(l-A) of the exact 
posterior mean, where 6 is a bound on the error in approximating the 
exact posterior mean by a first-order Taylor-series expansion. We also 
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showed how to determine, in practice, whether these conditions can be 
expected to hold. Note that the same conditions apply to the posterior 
mode and maximum likelihood estimate except that g..(p) is replaced by 
the appropriate function. 


5.8.4 Problems: 


Number_o£ jter itjons - For some cases having components near zero, 
convergence took a large number of iterations for the maximum likelihood 
estimate and the posterior mode. A few cases took over 200 iterations. 
As noted in Section 6.3, the largest number of iterations was 293 for 
the maximum likelihood estimate. 

Mu2tlp2e_s^l£t20jis - As discussed in Chapters 3 and 4, equations 
for the maximum likelihood estimate, posterior mode, and approximate 
posterior mean are generally expected to have multiple roots. However, 
as noted in Section 5.8.3, whenever the eigenvalues of the covariance 
matrix of the complete-data sufficient statistics are bounded above 
zero, an iterative solution for any of these three estimates converges 
to a local maximum. Therefore, to insure that the local maximum is a 
global maximum, we should choose that root that maximizes the likeli- 
hood. For the approximate posterior mean, we should choose that root 
that maximizes the posterior density given the prior 6=v+l; i.e., that 
root p for which the likelihood function 


, YY 1 , VY 1 

pi p 2 


k+1 


Z k+1 +V k+1" 1 TT i Z D 

S P D 


is a maximum. Although it has not been proved, from the complete-data 
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relationship between the posterior mode and posterior mean, we intu- 
itively expect the global maximum to be in the convergence region of 
the exact posterior mean p, or at least be the closest root to p. 

As illustrated by examples in Section 4D.5 and discussed in Sec- 
tion 4.3.2, however, for trinomial data we usually expect only one 

3 

root to satisfy the constraints 0-p. £ l, for all l-i-3, and £ p.=l for 

1 i-i 1 

p. any one of the three estimators. Further, exploratory calculations 
showed that the iterative algorithm for the approximate posterior mean 
converged to the same solution for a wide range of initial estimates. 
Finally, all three iterative estimates were close enough to the exact 
posterior mean and the generator Dirichlet probability that we did not 
expect a different root as the global maximum. Thus, we did not seek 
more than one solution. 
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5.9 Estimates of Mean Squared Error ; 

Recall that we defined the error e^p^-p^ for l-i-3, p^ referring 
to one of estimators APM, PMD, and MLE and p. referring to the generator 
Dirichlet probability vector. We want to estimate the mean squared 
error 

mse(*) = Ete^+^+e^] (5.11) 

of estimator p. 

For N denoting the number of simulation trials, the most common 
estimate of the mean squared error (5.11) is 

N 3 9 

mse ('•')' = £ E e..7N, (5.12) 

j=l i=l 1J 

where e^ is e^ on the jth simulation trial. We called (5.12) the "regu- 
lar" or "usual" mean-squared-error estimate. 

For estimating mean squared errors of estimators for minimizing 
expected quadratic loss, we used two Monte-Carlo techniques to reduce 
the estimate's variance. In both, we took advantage of any covariance 
of the quadratic-loss estimators APM, PMD, and MLE with the complete- 
data maximum-likelihood estimate p^ B x./n, for x^ denoting the number of 
the n (25 or 50) observations falling in category i. We called the two 
resulting estimates the control -variate mean-squared-error estimate and 
the regression mean-squared-error estimate. Both are discussed by 
Kleijnen (1975, Part I.Chpt.III). 

Let e.. denote e*.=p.-p. on the jth simulation trial and, 

1J 1 1 ] N 3 

paralleling (5.12), define mse(-)= £ £ e... 

j=l i=l 


Then all three mean- 
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squared-error estimates can be represented in the form: 

mse_est(*) = mse( • ) + b{E[mse(")] - mse(-)} (5.13) 

where 

3 3 3 

E[mse(-) ] = E( E e. 2 ) = E [p,(l-p,)]/n = [1- E p, 2 ]/n. (5.14) 

x x i=l 1 i=l 1 1 i=l 1 

For the regular mean-squared-error estimate, b=0. For the control - 

variate mean-squared-error estimate, b=l. For the regression mean- 

squared-error estimate, b is the regression coefficient b in the linear 
3 . 2 3 .. 2 

regression of E e. on E e. . Kleijnen (1975) discusses the gen- 
ial 1 i=l 1 

eral case for a constant b not necessarily equal to 1. 

Note that, in terminology of Kleijnen (1975), the regression mean- 
squared-error estimate is also a control -variate estimate. However, the 
latter term is often used to denote our b=l case and, to differentiate 
between the b=l and b=b re case, we follow this practice. 

If the regular estimate of the mean squared error and the regular 
estimate of the complete-data maximum-likelihood-estimate mean squared 
error are positively correlated such that 

var[mse(-) ] < 2 cov[mse(~) ,mse( •) ] < var[mse(-) ] + var[mse( •) ], (5.15) 

then the control -variate estimate mse( • ) of the mean squared error 
will have smaller variance than the regular estimate because 

var[mse(*)] = var[mse(*)] + var[mse(-) ] - 2 cov[mse(*) ,mse(-)]. (5.16) 

Note that both the regular and the control -variate mean-squared-error 


estimates are unbiased. 
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The value of b that minimizes the variance of (5.13) is the regres- 
sion coefficient 

b re = cov[mse(*) ,mse(-) ]/var[mse(-) ] (5.17) 

used in the regression estimate mse. For p the correlation coefficient 
var[mse(7y] = var[mse(*)] (1-p [mse( • ) ,mse( -) ] }. (5.18) 

Hence, the variance of the regression estimate is less than the variance 

of the usual estimate (5.12) by a factor depending on the correlation 
3 • 2 3 •• 2 

between £ e. and £ e. . We estimate b by the least squares 
1-1 1 i=l 1 re 

estimate 

N 3 3 N 3 

b = £ {[ £ e. . 2 -mse(*)M £ e.. 2 -mse (••)]}/ £ [ £ e. . 2 -mseR ] 2 . (5.19) 
re j=l i=l 1J i=l 1J j=l i=l 1J 

Although the regression estimate of the mean squared error has 

minimum variance, it is biased, because 

3 o 

E[SSi("J] = EIHS5FTJ + E(b ) E( £ e/) - E[b (5.20) 

re . =1 l re 

and, since b re is a function of mse(-) , the last term in (5.20) does not 
equal the second term. 

As Cochran (1967) notes, the amount of bias in the regression 
estimate is difficult to determine. Kleijnen (1975) reviews ways to 
decrease or remove the bias. However, implementation of these methods 
can be expensive. More important, 200 simulation trials was enough to 
remove most of the bias. Results showed that in most cases the regres- 
sion estimate of the mean squared error lay between the unbiased control- 
variate mse estimate and the unbiased regular mse estimate. Hence, in 
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all situations but one, we used the regression-estimate mse since it 
had the smallest variance. 

The one situation in which we did not use the regression estimate 
was in Design 2 for cases in which the denominator in (5.19) was zero. 
This sometimes happened when two components of the generated Dirichlet 
probability were zero to at least three decimal places. In these cases 
the complete-data maximum likelihood estimate was the same for all 200 
trinomial simulations. 
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5.10 Evaluation of Exact Posterior Mean and Covariance Matrices : 

Recall from Section 2.2.4, the dimension, range, precision, and cost 
problems that generally make numerical evaluation of the exact posterior 
moments unfeasible. 

In our simulation work, however, we 

(1) had the smallest-dimension case, the trinomial, 

(2) designed the simulation study to have sample sizes small enough for 
the percentage of incomplete data, and 

(3) were able to use a computer with good enough range and significant- 
figure accuracy 

to allow numerical evaluation of these exact moments. 

For the trinomial case, the number of terms in each numerator and 
denominator of the exact posterior moments is 

number of terms = ( z^ 2 +1 ) x ( z i 3 + l ) x ( z 23 + ^ ) ' (5.21) 

For sample sizes of 25 and 50, percentages of incomplete data of 15 and 
40, and probabilities roughly ranging from (0,0,1) to (1/3, 1/3, 1/3) , the 
number of terms (5.21) ranged from a low near 1 to a high of 
approximately 512. 

For the COC 6600 and Cyber 175 computers described in Section 5.4, 

-294 322 

the magnitude range is 10 to 10 . This range is unusually large 

for a computer, many of which have ranges more like 10~ 76 to 10 76 . 
Therefore, with these special-purpose CDC scientific and engineering 
computers, we could directly evaluate exact solutions for SS/PID 
combinations as large as 402/50 or 335/60. The maximum SS/PID 
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combinations that most other computers can handle is considerably 
smaller. By "directly" we mean without much extra programing, execution 
and storage cost, and additional rounding error for scaling down the 
magnitude of the terms. As noted in Section 5.4, these CDC computers 
have single-precision accuracy of about 14.5 significant figures. For 
this machine accuracy, use of 11 significant figures for the gamma r( ) 
functions, and the SS and PID used in this study, our evaluations of the 
exact posterior moments were accurate to at least 6 significant figures. 

Because they could be evaluated directly, equations for the exact 
posterior moments were programed in a straightforward manner. We used 


'12 /z 


■13 z 


•23 /z 


<?„ l b 7 r < z i^i +a+b ) t J, 


c )r(z 2 +v 2+ z 12 -a + c) 

' (5.22) 

xr(z 3 +v 3 +z 13 -b+z 23 -c) ]} 


as a base for all moment calculations, increasing various of the inner 
and outer sums to obtain numerators for the different desired moments. 

For each set of data we called a function GAM once to evaluate 
r( z i+Vi) , r(z 2 +v 2 +z 12 ), and r(z 3 +v 3 +z 13 +z 23 ) . GAM returned the 
gamma value from (1) exact values, (2) Abramowitz and Stegun (1970) 
tables (accurate to 11 significant figures), or (3) from Stirling's 
Formula for those cases in which the formula gave an approximation 
accurate to 11 significant figures. [Since Stirling's Formula is an 
asymptotic formula, there exists some number of terms beyond which the 
accuracy decreases. For example, r(3) can not be accurately approxi- 
mated by Stirling's formula to more than six significant figures; the 
accuracy decreases beginning with the seventh term.] 
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From then on, gamma terms in formula (5.22) and its variations were 
evaluated by the relationship 

r (y+i) = y r(y) 

for both integer and non-integer values of y. Note that for approxi- 
mately half the cases, the argument to the gamma function was non-integer 
The coefficient in (5.22) was calculated as 





where 


i 


was set to 1. 



CHAPTER 6 


RESULTS OF DESIGN 1 


6.1 Introduction : 

In this chapter, we present results from Design 1. In the following 
second section, we list special mnemonics common to these next two chap- 
ters. In the third section, we discuss characteristics of the estimators 
arising from the trinomial simulations. In the fourth section, we review 
results from approximations for elements of the posterior mean and covar- 
iance matrices. As part of this review, we discuss which of the Taylor - 
series approximation, posterior mode, and maximum likelihood estimate 
best approximates the posterior mean. Finally, we investigate results 
from which estimator best minimizes quadratic loss. A summary section 
concludes the chapter. 
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In addition to the mnemonics defined in Section 5.2, we will also 
use the following in these next two chapters: 


APM 

APMRO 

APMR1 

APMR2 

EST 

MLECD 

NU 

OPID 

P 

PMD 

PMDRO 

PMDR2 


APMRO (used in discussions concerning approximations for EPM, 
for which there was no robustness study) 

approximate posterior mean APM for robustness set 0 (original 
prior used in Bayesian estimators) 

approximate posterior mean APM for robustness set (uniform 
prior used in Bayesian estimators) 

approximate posterior mean APM for robustness set 2^ (perturbed 
prior used in Bayesian estimators) 

est i mator 

maximum likelihood estimate for pomplete data (used as control 
variate in risk study) 

prior parameter v 

observed percentage of incomplete data 
P 

PMDRO (used in discussions concerning approximations for EPM, 
for which there was no robustness study) 

posterior mode PMD for pobustness set 0 (original prior used 
in Bayesian estimators) 

posterior mode PMD for pobustness set 2^ (perturbed prior used 
in Bayesian estimators) 



- 221 - 


6.3 Estimators : 

In this section, we discuss a few properties of estimators from 
the simulated trinomial data. Recall that for each combination of p, 

SS, and PID we simulated 200 sets of complete and incomplete trinomial 
data. From each set of incomplete data, we calculated the estimators 
EPM, APMRO , PMDRO, MLE, APMR1 [recall that PMDR1=MLE] , APMR2 , and 
PMDR2. The RO, Rl, and R2 suffixes refer to robustness sets RO, Rl, 
and R2, respectively. From each set of complete trinomial data, we 
calculated the complete-data maximum likelihood estimate MLECD. 

To examine the sampling distribution of the estimators, we calcu- 
lated data summaries (extremes, hinges, and median), central values 
(mean, median, and trimean), and spreads (midspread and range) over the 
200 trinomial simulations. Prominent features were that the exact 
posterior mean and Taylor-series approximate posterior mean had almost 
identical distributions. So also did the complete-data and incomplete- 
data maximum likelihood estimates. Since the priors were nonzero, EPM 
and APM always had nonzero values. However, PMDRO, MLE, and MLECD had 
a large number of zero values when p=(.01,.01,.98). 

The number of iterations for convergence is given in Table 6.1. As 
expected, the number of iterations increased as the percentage of 
incomplete data PID increased. The largest change was for j^; for the 
original prior, the number of iterations approximately doubled. Direc- 
tion of sample-size effect was consistent only for APMR1. For this 
estimator, the average number of iterations decreased from 2% to 15% 
as SS increased. 
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One factor affecting the average number of iterations for estima- 
tors at was that 169 of the 9,600 (48x200) sets of six iterative 
estimators for required more than 15 iterations. The maximum likeli- 
hood estimate constituted most of this 2%. The largest number of itera- 
tions was 293 for the maximum likelihood estimate. The large number of 
iterations occurred when one or more components of the simulated in- 
complete data z was zero. 
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6.4 Approximating Posterior Moments : 

6.4.1 Poster iojr Meanj_ 

Our most important measure of the goodness of an approximation was 
the percentage absolute realtive difference. In Table 6.2 we give the 
proportion of 200 trinomial simulations for which the percent absolute 
relative difference for each of the three components of an approximation 
was less than specified amounts. 

With a few exceptions at p^ and p^, for all cases the percentage 
absolute relative difference between the Taylor-series approximate pos- 
terior mean (APM) and the exact posterior mean (EPM) was less than 1%. 
That is, 

|p i -p i |/p i x 100 < 1 for l-i-3, 

so that 

IP-j-P-j I < 0.01 x p. 

• 

for al 1 three components p^ , l-i-3. Hence, the approximation was 
accurate to at least two significant figures. The few exceptions are 
studied later in this section. 

Moreover, when PID=15, the APM approximation was accurate to at 
least three significant figures for nearly all cases and to at least 
four significant figures for the majority of cases. When PID=40, the 
approximation was accurate to at least three significant figures for 
most cases. 

As sample size increased from 25 to 50, the APM approximation 
generally improved for p£, p^, and p^. For p^ it slightly worsened. 
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The reason is that there were a number of cases for p^ and where APM 

was identical to EPM for SS=25. As the amount of sample data increased, 
the possibility of a perfect approximation lessened. As already indi- 
cated, as PID increased from 15 to 40, the APM approximation worsened, 
least for p^ and most for p^ in terms of three- and four-significant 
figure accuracy. 

In general, the posterior-mode (PMD) and maximum-likelihood-estimate 
(MLE) approximations were not accurate to even two significant figures. 
The main exception was at p^ when SS was 50. There the posterior mode 
agreed to two significant figures for approximately one-third of the 200 
trinomial simulations. 

Analyses later in this section showed that even in the few problem 
cases for p 1 and p 2 , APM was a much better EPM approximation than either 
PMD or MLE. Also, analyses found no bias, mean-squared-error, iteration, 
or other problems favoring PMD or MLE over APM. Finally, Table 6.2 
showed that, except possibly for the APM problem cases, APM was far 
superior to PMD and MLE in approximating the exact posterior mean in 
terms of percentage relative difference. Therefore, following a few 
comments in the next paragraph, we henceforth concentrate only on APM 
as an approximation for EPM. 

Because the exact posterior mean (EPM) was never zero and PMD and 
MLE were, PMD and MLE were poorest approximations for p^=(.01,.01,.98). 
The better of PMD and MLE was MLE for p^ and p^ and PMD for Pg and p^. 
However, note that even for p^, when PMD improves in its approximation, 
it is far inferior to APM. Plots given later in this section 
illustrate these comparisons. 
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In Table 6.3 we present the bias for the first component of the 

three approximations to the exact posterior mean. For PI D= 15 and PID=40, 

200 . 

the bias was estimated over the 200 trinomial simulations by £ (p .-p ) 

j=l 1J 

/200=E (p^)-E(p^) for p^ the first component of one of the three approxi- 
mations APM, PMD, and MLE. The complete-data (PI D=0 ) bias is given for 
all estimators except for the posterior mode at Pj=(.01,.01,.98). 

Recall that the prior used in the Bayesian estimators for p^ in Design 1 
was VjH( 0. 1,0. 1,9.8). For the pair having such small values for 

i=l,2, a solution to the likelihood equations usually does not exist in P^. 
[Note that E[ (x. | .+v. j -l)/(n+£Vj-3) ] < 0 for v^=0.1, p^=.01, and n=25 or 50, 
since E(x^)=np i - .] In this case, the posterior mode occurs on a boundary. 
Hence, ^=0 for i = 1 , 2 . Thus, the likelihood equations are not used to 
define the posterior mode; therefore, the bias can not be analytically 

h 

calculated from the i solution (2.43) to the likelihood equations. 

Although the bias was small for all approximations, it was one to 
three orders of magnitude smaller for APM. For APM, the bias was smallest 
in absolute value for p^. For PID=15, it was largest in absolute value 
for or p^; for PID=40, it was largest in absolute value for p^ or p^. 

As sample size increased, the bias generally decreased. The bias was 
positive for p^ and p^ and of both signs for p^ and p^. Note that, for 
p^, p^, and p^, results for the second component of the bias were the 
same as those for the first component of the bias. Results for the 
second component, .30, of p^ were similar to those for the first and 
second components, both .33, of p^. 
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Although the estimated biases were small, the individual errors 
constituting the estimated biases could be large. To investigate this 
possibility, we calculated data summaries, central values, and spreads 
over 200 trinomial simulations for the errors of APM, PMD, and MLE in 
approximating EPM. In general , as sample size increased, error decreased; 
as PID increased error increased; and as p moved from the corner p^= 

(.01, .01, .98) to the center p^=(l/3,l/3,l/3) of the P 2 simplex, the error 
decreased. 

The central values, especially the mean, often differed because the 
distribution of the errors was not symmetric. To examine this asymmetry, 
we studied the proportion of the 200 simulations in which the first com- 
ponent of the error was of a given sign. Results showed, for SS=25, 
that for p 2 , p 3 , and p^ approximately one half of the APM errors were 
negative. The remaining half were zero or positive. For p^, however, 
almost three fourths of the errors were negative. . As sample size in- 
creased to 50, the errors remained roughly split as half negative and 
half positive for p^ and p^. For p 2> however, the error was approximate- 
ly two-thirds negative and one-third positive. For p^, it was close 
to 92% negative, 4% positive, and 4% zero. As expected, the distribution 
of the APM error was much tighter than those for the PMD and MLE errors. 

Finally, the smallness of the midspread relative to the range for 
all but the (SS=50,PI D=40 } case (as well as values of the hinges rela- 
tive to those of the extremes), indicated that most of the APM errors 
clustered close to zero and that the extreme values were few and unusual. 

We next studied these extreme values. In particular, we investi- 
gated those cases in Table 6.2 that showed a percentage absolute relative 
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difference greater than 15. All these cases occurred at p^=( .01 , .01 , .98) . 

First, all these cases had empty (0) cells for and z Second, 
all these cases occurred for PID=40. However, the observed PID is not 
necessarily 40. We called this observed £ercentage of incomplete data 
OPID. Those cases having high percentage relative difference usually 
had very high OPID (often in the 50%). Finally, for these cases, the 
incomplete data was "inconsistent" with the completely specified data 
and, perhaps less important, with the sampling model. That is, under 
the sampling model with z^=Z 2=0 and z^ large, we would expect z ^ small 
and z i2^ z Z3‘ Examples are shown in Figure 6.1, where the estimators 
are given in successive order as the exact posterior mean, Taylor-series 
approximate posterior mean, maximum likelihood estimate, and posterior 

mode and where, again, z= ( z i> z 2> z 2 ,z l2 ,z 13 ,z Z3 ^ . 

In all three examples, the percentage of incomplete data is very 
high, 60%, 56%, and 50%, respectively. Further, the data are inconsis- 
tent. To see the inconsistency, compare the generated data z with the 
expected value of the data given the sampling model. Recall, from 
Chapter 5, especially Section 5.7.3, that the sampling model is a 
function of p, PID, and SS. Expected values of z are given in each 
example. The most noticeable discrepancy between the expected and 


generated data is in the relationship between z^ 


and z 22 * 


The expected 


values are identical. The observed values, however, differ greatly. 


In example 1, is approximately one-half z^y in example 2, z 13 is 
more than three times z^y and in example 3, z 13 is almost twice z^y 
Thus, the probability of observing any data set in these examples, 


given the sampling model, is small. 
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FIGURE 6.1 

WORST APM APPROXIMATIONS FOR EXACT POSTERIOR MEAN 

1. 0PID=60*, SS=25 , z=(0,0,10,l,5,9), E (z )= ( . 15 , . 15 , 14 . 7 , . 1 ,4 . 95 ,4 . 95) ; 

Estimators Error % abs rel diff 

p=(. 0188,. 0246,. 9566) 

M. 0138,. 0304, .9558) (-.0050, .0058, -.0008) ( 26.6, 23.6,0.1) 

p=(. 0001, .0625, .9374) (-.0187, .0379, -.0192) ( 99.5,154.1,2.0) 

MO, 0, 1 ) (-.0188, -.0246, .0434) (100.0,100.0,4.5) 

2. 0PID=56*, SS=25, z=(0,0, 11, 1,10,3) , E(z)=(. 15, .15, 14. 7, .1,4.95,4.95); 

Estimators Error % abs rel diff 

p=(. 0266, .0168, .9566) 

M- 0351, .0102, .9547) ( .0085, -.0066, .0019) ( 32.0, 39.3,0.2) 

p=(. 0666,0, .9334) ( .0400, -.0168, -.0232) (150.4,100.0,2.4) 

MO, 0, 1 ) (-.0266, -.0168, .0434) (100.0,100.0,4.5) 

3. 0PID=50*, SS=50, z=(0,0,25,2,15,8) , E(z )=( . 3 , . 3,29. 4, .2 ,9 .9 ,9.9) ; 

Estimators Error % abs rel diff 

p=(. 0275, .0186, .9539) 

H. 0369, .0105, .9526) ( .0094, -.0081, -.0013) ( 34.2, 43.6,0.1) 

p=(. 0571, .0001, .9428) ( .0296, -.0185, -.0111) (107.6, 99.5,1.2) 

M. 0262,0, .9738) (-.0013, -.0186, .0199) ( 4.7,100.0,2.1) 

★ 

PID=40 for all three examples 
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In practice, one does not know the population model and thus can not 
check for consistency in the same manner. However, if one calculates the 
expected value of the data using the estimator p and OPID (the observed 
percentage of incomplete data), rather than p and PID, one finds the same 
discrepancy between and z ^ (even though p is a function of z). 

These expected values for the three examples are: 


1 . 


2 . 


3. 


E(z | p,0PID) = ( .28, .37 ,14. 35 , .22 ,4. 88,4. 91) 
E(z |p, OP ID) = (.21,. 46, 14. 34,. 22,4. 85,4. 93) 


E(z|p,0PID)=( 1.00,0. ,14. 00,. 33, 5. 00, 4. 67) 
E(z | p,0PID)= (0.00,0. ,15.00,0.0,5.00,5.00) 


E(z|p,0PID)=(.29,.18,10.52,.30,6.88,6.81) 
E(z |p,0PID)=( . 17, . 17,10. 78, . 31 ,6.93,6. 75) 


E(z | p,0PID)=(. 73,0.0,10.27,.47,7.00,6.53) 
E(z|p,0PID)=(0.0,0.0,11.00,0.0,7.00,7.00) 


E(z 

E(z 

E(z 

E(z 


p,0P ID) =(0.69, .47,23. 85,. 58, 12. 27 ,12. 16) 
p, OP ID) = (0.92,. 26, 23. 82,. 59, 12. 37, 12. 04) 
p, OPID) = (1.43,. 00,23. 57,. 72, 12. 50, 11. 79) 
p,0PID) = (0.66,0.0,24.35,.33,12.50,12.17) , 


respectively. Therefore, to indicate whether data are inconsistent, an 
approach that can be used in practice is to compare the data with the 
expected value of the data given OPID and any of these four estimators. 

For the Taylor-series approximate posterior mean (APM), the second 
and third examples had the highest percentage absolute relative difference 
of all cases. The second example is the one case keeping the proportion 
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from being 1.00 in column 7 of Table 6.2 for "% abs rel diff < 25". 

Note that, as also found in the remaining cases, the posterior mode and 
maximum likelihood estimate were even worse approximations than was the 
Taylor-series approximate posterior mean. 

As an extra check that the Taylor-series approximate posterior mean 
was the best approximation for the exact posterior mean, even in the 
rare cases just illustrated when the percentage relative difference was 
high, we calculated the proportion of 200 trinomial simulations when 
an estimator was best. Because it is possible, especially with three 
estimator components p., l-j-3, for an approximation to be minimum with 
respect to one criteria but not with respect to another, we used two 
different criteria to determine when an estimator was best. For a 
squared-error criterion, for each of the l-i-200 trinomial simulations, 
we chose the approximation that had the smallest squared error. 


Z (p. H -p- _•) • For a relative-difference criterion, for each of these 


j=l 


U r iJ 


200 simulations, we chose the approximation having the smallest absolute 

3 

relative difference I |p. .-p. .j/p. .. Note that the divisor in the 

j_2 1 J ^ 1 J 

latter criterion was never zero. By both criteria, for all sets of p, 
PID, and SS variations, and for both replications, APM was always a 
better approximation for EPM than were PMD and MLE. 

Relating to the squared-error criterion, we next investigated in 
Table 6.4 the mean squared errors of the approximations. Since APM 
always had the smallest squared error for each of the 200 trinomial 
simulations, it also had to have the smallest mean squared error [often 
called the average mean squared error]. However, we were also 
interested in order-of-magnitude comparisons among estimators and how 
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mean squared error varied with p, SS, and PID. Mean squared error (mse) 
is often used for comparison among estimators because it measures estima- 
tor variance as well as bias. 

As for the bias, we also calculated the mean squared errors for 

the complete-data (PID=0) estimators. Note that, as discussed for 

Table 6.3, we could not analytically calculate the mean squared error 

for the posterior mode at p, . For P I D= 15 and PID=40, mean squared error 

200 . 

was estimated by the "usual" estimate E (p.-p. ) /200. We did not use 

j=l 

any variance-reduction techniques, such as discussed in Section 5.9, in 
estimating these mean squared errors because the control variate p for 
the risk study was not expected to be helpful for the exact-posterior- 
mean study. Further, the mean squared error was not as important in 
the exact-posterior-mean study as it was in the risk study. Hence, the 
greater care in its estimation was not necessary. Finally, the differ- 
ence between the regular APM mean-squared-error estimate and either of 
the regular PMD or MLE mean-squared-error estimates was so large that 
use of a variate-reduction technique was not expected to alter results 
concerning differences. 

Results of Table 6.4 show that the APM mean-squared-error estimate 
was \\ to 6 orders of magnitude smaller than those for PMD and MLE. 

Mean squared error increased 1 to 2 orders of magnitude as PID increased 
from 15 to 40. It usually decreased as SS doubled. For easier 
comparison of APM with PMD and MLE, average bias and mean-squared-error 
ratios are given in Table 6.5. Note from Table 6.2 that the bias ratios 
are only for the first component of an estimator. 
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Finally, recall Table 6.1 showing the number of iterations re- 
quired for convergence of the iterative approximations. Since the ini- 
tial iterative estimate was the exact posterior mean, the number of 
iterations was some measure of which approximation was best. By this 
measure also, the Taylor-series approximate posterior mean was the 
superior approximation. 

We next performed an analysis of variance (ANOVA) on the bias and 
on the mean-squared-error data. From theory in Steel and Torrie (1960, 
pl57) and Snedecor and Cochran ( 1968 , p324-5 ,329) and from examples of 
Dempster, Schatzoff, and Wermuth (1977,p77) and Gunst and Mason (1977, 
p616), we expected errors from an ANOVA on the original mean-squared- 
error data to exhibit enough nonnormality and inequality of variances 
to yield too many false significant F tests. Therefore, for protection 
against this occurrence, along with improved additivity of the model, 
we transformed the mean squared errors to natural logarithms. Doing so, 
however, meant that all mean-squared-error results are interpreted in 
terms of the log(mse) rather than more naturally in terms of the 
original data. However, for the risk study we do give an approximate 
translation of results from logarithms back to the original data. 

Note that, although an ANOVA is concerned with all factors affect- 
ing bias and log(mse), we are interested only in those significant 
effects involving the estimators. Note also that usually one studies 
residuals from the ANOVA model to detect failure to meet assumptions 
and to learn whether any transformation might correct the failure. 
However, Scheff£ (1967, p363) generally recommends against transforming 
data to reduce nonnormality in analyzing means. He does so because 
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interpretation of results concerning transformed data is often difficult. 
We have already transformed to logs. Thus, further transformation, even 
if warranted, would lose more in ease of interpretation than would be 
gained in improving assumptions, especially since the F-test is already 
fairly robust against assumptions. Therefore, we do not analyze the 
residual s. 

Results from the bias ANOVA, along with significant values, are 
given in Table 6.6A. The presence of high-order significant interactions 
affects conclusions about lower-order interactions and the main effects. 
For the EPM bias ANOVA in Table 6.6A, the main effects for P (p) and 
estimator EST and the two-factor interaction PxEST are so highly signi- 
ficant relative to the remaining effects that, together with previous 
bias results, we expect the remaining significant two-factor and three- 
factor interactions to mean only that effects of EST, P, and PxEST vary 
with SS and PID. 

Plots in 6.6B confirm this hypothesis. As sample size SS increases 
or PID decreases, the average bias (summed over those factors not 
appearing in the plot) slightly decreases. Approximation APM has zero 
average bias. So also, approximately, does MLE. The most striking 
effect of these two plots is the poorness of the posterior mode as an 
approximation for the exact posterior mean for all but p^, and especial- 
ly for p£, in terms of average bias. 

In Table 6.7A we present F values in the ANOVA for natural 
logarithms of the estimated mean squared errors given in Table 6.4. 

Since estimator EST has such huge significance relative to other fac- 
tors, it will be at least partly responsible for the significant 
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higher-order interactions. The significant estimator two-factor 
interactions are plotted in Table 6.7B. The larger the negative value 
of log e (mse), the better the approximation. Thus, plots in 6.7B show 
that mean squared error decreases slightly as SS increases or PID 
decreases and that APM is the superior approximation. Approximation 
APM is poorer at p^ than at the remaining values of p. 

The three-factor interaction PxPID*EST was significant at the 10% 
level. The effect of PID on the P*EST plot given in 6.7B was that, as 
PID increased from 15 to 40, differences between APM and either of MLE 
and PMD decreased and all approximations slightly worsened. 

6.4.2 Posteriory Covan' ance_Matn x: 

In this subsection, we discuss results from Design 1 concerning how 
well the truncated Taylor-series expansion approximated elements of the 
posterior covariance matrix. 

Note first that elements of the Taylor-series approximate posterior 
covariance matrix were calculated by the method that is noniterative in 
elements of the posterior covariance matrix. This method was described 
in Section 3.2.8. After convergence of components of the approximate 
posterior mean vector, we solved a linear system of equations for the 
approximate variances and covariances. These approximations are functions 
of the approximate posterior means. Thus, the accuracy of the posterior 
variance and covariance approximations is a function of the accuracy of 
the posterior mean approximations. 

Data summaries, central values, and spreads over 200 trinomial 
simulations were calculated for the covariance approximations for the 
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first replication. In general, results indicated very good agreement 
between sampling distributions of the Taylor-series approximations and 
the exact posterior covariances. Agreements improved as p moved from 
the corner p 1 to the center p^ of the P 2 simplex. Central values agreed 
well for all values of p except p^, where, as noted in the last section, 
the distribution of values was heavily skewed because we were at a lower 
bound for the first two components! 

As for the posterior mean, the most important measure of the accu- 
racy of an approximation for an element of the posterior covariance ma- 
trix was the percentage of absolute relative difference. In Table 6.8, 
we give the proportion of 200 trinomial simulations in which the percent- 
age absolute relative difference of the Taylor-series approximation is 
less than specified amounts. The column headings Cl 1 , C12, and C22 
denote varCpJz), cov(p 1 ,p 2 |z), and var(p 2 |z), respectively. 

Results show that the variance approximation was correct to at 
least two significant figures for nearly all 200 trinomial simulations 
when PID=15. When PID=40, the proportion of 200 variance approximations 
accurate to at least two significant figures ranged from .83 to 1.00. 
Further, for the majority of cases , the variance approximation was 
accurate to at least three significant figures. 

Excluding p^, we find that the approximation remained excellent or 
improved as p moved toward the center of the P 2 simplex. Except for p^, 
the approximation worsened as PID increased. Sample size SS had little 
effect when P ID= 15 because the approximation was already excellent when 
SS=25. When PID=40, the approximation remained excellent for p^, 
slightly improved for p^ and p 2 , and slightly worsened for p^ as SS 
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increased. 

We next investigated the accuracy of the Taylor-series approximation 
for the posterior covariance. Results in columns headed by "C12" show 
that it was not as good an approximation as that for the variance. Even 
so, for nearly all trinomial simulations, the covariance approximation 
was correct to at least two significant figures. As for the variance 
approximation, the covariance approximation remained excellent or improved 
for p^ and p^ and became poorer for p^ as the sample size increased. As 
the percentage of incomplete data increased, the approximation worsened. 

To examine relatively poorer results for p^ and p^, we investigated 
averages (over 200 trinomial simulations), percentage average relative 
difference, average percentage relative difference, and ratio of square 
root of the estimated mean squared error to the average exact value. 

For p^ and p^, the covariance averages were approximately an order 

of magnitude smaller than the variance averages. In particular, for pp 

_5 

the exact posterior covariances ranged in value from -0.4x10 to 
-4 

-.2x10 . It could be that values so close to zero were more difficult 

to approximate. To support this hypothesis, we noticed that when covar- 
iances roughly equaled variances, then the average percentage relative 
differences were also roughly equal. For example, average percentage 
relative differences for the approximate posterior covariance of p^ and 
p^ given z at p^ and the approximate posterior variance of p^ given z at 
Pp both at PID=15, were of the same magnitude and their average percen- 
tage relative differences were also of the same magnitude. 

For all but one case, the square-root ratio was less than 1. 

Finally, the standard errors of the average variance and covariance 
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approximations were large relative to the averages. Therefore, 
statistically, we could not differentiate between the approximations 
and the corresponding exact values. 

We also investigated biases of the Taylor-series approximations by 
examining data summaries, central values, and spreads over 200 trinomial 
simulations. These results showed that the sampling distributions were 
tight. Not only were the means, median, trimeans, and midspreads zero, 
but also the ranges were zero to at least three, and usually four, 
decimal places. 

In general, as sample size increased, the bias decreased. As per- 
centage of incomplete data increased, bias increased. As p moved toward 
the center of the P 2 simplex, bias decreased. Exceptions again occurred 
at pj and p 2 because of the larger number of perfect approximations at 
those values of p. 

To determine whether the Taylor-series approximations generally 
over approximated, we next investigated the proportion of biases having 
a positive sign. A positive bias is preferable for a variance approxi- 
mation because we have defined bias as "approximation - exact". Thus, 
if most of the biases. are positive, then the approximation generally 
provides an upper bound on the exact posterior variance. 

Results showed that the proportion of positive biases was, as has 
been for other measures, a function of the position of p in the P 2 
simplex. When p was near the center of the simplex, most of the biases 
were positive. As p moved toward a corner of the simplex, the propor- 
tion of negative biases increased. At a corner, negative biases 


dominated. 
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Although the large proportion of negative variance biases at p^ is 
not preferred, it is not of concern as long as the percentage relative 
difference of the approximation is small. At the end of this section, 
we investigate cases where the percentage relative difference is greater 
than 15. 

For the covariance approximation, a negative bias was preferred. 

Since the covariances were negative, a negative bias meant that the approx 
imate covariance was larger in absolute value than the exact covariance. 
The controlling factor for the proportion of negative biases roughly 
correlated with the sum of the two covariance elements. When the sum of 
the two generator p components was less than 0.75, the proportion of 
negative biases was larger than that of positive biases. When the sum 
was higher than .75, between .75 and 1.00, the opposite occurred. For 
example, the proportion of negative biases for covfp^p^z) for p^ 

(.01, .01, .98) was near 1; that for covlp^.p^lz) was near 0. As SS or 
PID increased, the proportion of negative biases generally increased. 

We now investigate those variance and covariance approximations 
differing in percentage absolute relative value from the exact values 
by more than 15%. In approximately one-third of these cases, the 
Taylor-series approximation and the exact posterior mean also differed 
in percentage absolute relative value by more than 15%. We expected 
this correlation since elements of the posterior covariance matrix were 
functions of the approximate posterior means. In these situations, 
approximations for the posterior variances were usually equal to or 1 - 
10% points better than the posterior mean approximation; approximations 
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for the posterior covariance, usually equal to or 1 - 10% points worse. 
Part of the reason the covariance approximation was worse than the 

variance approximation seems, again, to be that the closer the exact 

value was to zero, the harder it was to estimate. 

Recall from Section 6.4.1 that, in these cases of poorer approxi- 
mations for the posterior mean as well as for the posterior covariance 
matrix, the incomplete data z had zero observations for z^ and z 2> the 

true percentage of incomplete data TPID was usually very high, and the 


incompletely specified observations z^, z ^ 3 ’ anc * z 23 were inconsistent 
with the completely specified observations z^, z 2 , and z^ and with the 
sampling model . 


Of the remaining two-thirds cases, three-fifths also had zero ob- 


servations for z ! and z 2 and had inconsistent data. Most also had high 
percentage of incomplete data. Of the last two-fifths of the cases, all 
but two had percentage absolute relative difference less than 24. These 


percentages were 


(32, 20, 15), 20, 19, 21, 21, 22, 17, 23, 19, (41, 24, 18), 15, 17, 19, 16 ,23, 22, 16. 


Two values of 15 are present because they were greater than 15.000. 
Numbers in parenthese apply to the same set of data. All percentages, 
except the 20,15 and 24,18 in parenthesis, are for cov(p 1 ,p 2 |z) , 
the covariance of two very small values, each varying around 0.01. The 
20,15 and 24,18 were values for var(p 2 |z) ,cov(p 2 ,p.j| z ) . Nearly all 
of these cases occurred for data sets having one of z^ and z 2 equal to 
0 and the remaining value equal to 1. 
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The values 32 and 41 enclosed in parenthesis were of concern. The 
data for these values was z=(l,0,12,l,3,8) and z=(l,0, 26,1,6,16) , respec- 
tively. Observed percentages of incomplete data (OPID) were high, 48% 
and 46%, respectively. Further, the data was inconsistent for these two 
cases for a sampling model yielding E(z)=(0,0,15,0,5,5) and E(z)=(0,0,30, 
0,10,10), respectively. As for the three "problem" examples given in the 
last section, under this sampling model, with EU^E^^O and E ( z^ ) 
large, we would expect z i3^ z 23 * Yet » bot ^ cases z 23 approximately 
three times as large as z^> 

In essence, when the posterior means of p^ and p^, respectively, 
were very small, we expected the posterior covariances of p^ and p^ to 
be very small. Trying to approximate very small covariances, or covar- 
iances of very small values, was relatively difficult, especially when 
at least one of the two corresponding completely specified observations 
Zj and z^ was zero. 


6.4.3 Conclusions: 


The Taylor-series approximation for the exact posterior mean was 
excellent. In most cases it was accurate to at least three significant 
figures; in many cases, to at least four. In the few exceptions, where 
the percentage absolute relative difference ranged between 15% and 40%, 
the data had zero values for two of the three completely specified cells, 
the percentage of incomplete data was usually very high (40% - 60%), 
and the incompletely specified data was inconsistent with the completely 
specified data and with the sampling model. Even in these cases, 
however, the Taylor-series approximate posterior mean was a better 
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approximation than the posterior mode or maximum likelihood estimate. 

The posterior mode and maximum likelihood estimates were nearly always 
very poor approximations for the exact posterior mean. 

The posterior variance and covariance Taylor-series approximations 
were functions of the Taylor-series approximate posterior mean. There- 
fore, they were not quite as excellent as approximations in terms of 
percentage relative difference; the error of the Taylor-series 
approximate posterior mean was built into their errors. Nonetheless, 
they were very good. In nearly all cases, they were accurate to at 
least two significant figures; in most cases, to at least three. As for 
the posterior mean, exceptions occurred for inconsistent incomplete 
data having zero values for any two of the three completely specified 
cells, especially when the percentage of incomplete data was high. 
Exceptions also occurred for the posterior covariance approximation of 
two components both having values near zero when the incomplete data 
had zero observations for either one of the corresponding completely 
specified cells. 

In general, the Taylor-series approximate posterior variance was 
a slightly better approximation than the Taylor-series approximate 
posterior covariance, which was usually of values closer to zero. 

Results indicated that the closer a value was to absolute zero, the 
harder it was to approximate. 

As expected, all approximations generally improved as sample size 
increased or percentage of incomplete data decreased. An exception were 
values near a boundary of the simplex, where, for a sample size of 25, 
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a number of approximations were perfect. As the sample size increased, 
the possibility of a perfect fit lessened. 

As p moved from a corner toward the center of the simplex, 
approximations generally improved in terms of all the measures that were 
considered, except for those cases near the P^ boundaries already 
having a perfect or near-perfect fit. 
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6.5 Minimizing Risk for Quadratic Loss : 

6.5.1 |n;tro_duct_i on j_ 

In this section, we report results from determining which of three 
estimators best minimized risk, expected quadratic loss, for specified 
values of p. Two of the estimators were the maximum likelihood estimate 
MLE and the posterior mode PMD. The remaining estimator was the Taylor- 
series approximate posterior mean APM. Except at the end of this intro- 
ductory section, we do not report results from using the exact posterior 
mean EPM because these results were the same as those from using the 
approximate posterior mean. We report APM results instead of EPM results 
because we expect the Taylor-series approximation to be more often used 
in practice. 

As discussed in the introductory chapter. Chapter 1, we were parti- 
cularly interested in whether the maximum likelihood estimate was best 
for probabilities at the boundaries of the P^ simplex; the posterior 
mean, otherwise. Therefore, the generators were chosen to represent one 
extreme probability Pj=(.01 ,.01,.98) , a probability near a corner of 
the simplex, and one probability p^= ( 1/3 , 1/3 ,1/3) at the center. The 
remaining two probabilities P£=(. 10, .10, .80) and p^. 20, .30, .50) lay 
between the boundary and the center. Hence, if the maximum likelihood 
estimate iis_ best for p^ and the posterior mean, for p^, we will be 
particularly interested in whether or P 3 or some Probability between 
them is a crossover point for which estimator best minimizes risk. 

As discussed in Chapter 1, we compare the three estimators by using 
two wrong priors, as well as the correct, original, prior in their 
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calculations. Note that the maximum likelihood estimate, not being a 
Bayesian estimate, was the same for all three studies. We labeled 
these three studies as RO (robustness study 0), R1 (robustness study 1), 
and R2 (robustness study 2). 

For the first wrong prior, in robustness study Rl, we chose the 
uniform prior (1,1,1) because of its common use when one is uncertain 
of prior knowledge. The uniform prior gives equal weight to all compo- 
nents of p. For this prior, the posterior mode equals the maximum 
likelihood estimate. For the second wrong prior, in robustness study 
R2, we chose 10*[v/10+( .09, .05,-. 14) ] , where v is the original prior. 
This prior perturbs the three components of p by .09, .05, and -.14, 
respectively. Hence, we called it the perturbed prior. Values of the 
original -prior mean p versus the wrong-prior means are given in Figure 
6 . 2 . 



FIGURE 6.2 



PRIOR MEANS FOR THREE ROBUSTNESS 

STUDIES 

Rl 

R0 

R2 

prior mean for 

prior mean for 

prior mean for 

uniform prior 

original prior 

perturbed prior 

(1/3, 1/3, 1/3) 

(.01, .01, .98) 

(.100, .060, .840) 

(1/3, 1/3, 1/3) 

(.10, .10, .80) 

(.190, .150, .660) 

(1/3, 1/3, 1/3) 

(.20,. 30, .50) 

(.290, .350, .360) 

(1/3, 1/3, 1/3) 

(1/3, 1/3, 1/3) 

(.423, .383, .193) 


The situation of having previous data but data that yields the wrong 
prior is more realistically addressed by the perturbed prior in the R2 
study. In this study, we picked a wrong prior that was extreme relative 
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to the correct prior. For example, if the original prior mean in 
Figure 6.2 is p^=( .01, .01, .98) , then prior data giving a mean of 
(. 10, .06, .84) is unlikely [but not impossible]. 

Results of robustness study R0 (original prior used in Bayesian 
estimators) are given in the next section. Section 6.5.2. Those for 
robustness study R1 (uniform prior used in Bayesian estimators) are 
given in Section 6.5.3. Results for robustness study R2 (perturbed 
prior used in Bayesian estimators) are given in Section 6.5.4. Section 
6.5.5 summarizes these results for minimizing risk for quadratic loss. 

Before leaving this section, we briefly discuss the mean-squared- 
error (mse) estimates. Risk for quadratic loss is also called mean 
squared error. As described in Section 5.9, we had three estimates of 
mean squared error. These were the regular, control -variate, and 
regression estimates. We found in Section 5.9 that the regression mse 
estimate had the smallest variance. Nonetheless, for 200 trinomial 
simulations, the regression-estimate sample variance usually did not 
differ greatly from that of the regular or control -variate estimates. 
The differences were nearly always within one order of magnitude. The 
main exception was a two orders-of-magnitude difference between the 
control -variate and regression estimates for PMDR0 at p.^ when SS=25. 

Recall that the regression estimate is biased; the other two are 
not. However, in almost all cases the biased regression estimate lay 
between the unbiased regular and control -variate estimates. In the few 
exceptions, it was close to one of the two unbiased estimates. Hence, 
its bias was negligible. Therefore, since the regression estimate had 
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the smallest variance, we used it as the estimate of mean squared error, 
the estimator's risk. 

6.5.2 Ori_gi_na_l Pjri o r_i n Bayesi an_Es_ti_ma_to_rs : 

In this subsection, we discuss results from robustness study RO 
where we used the original, correct, prior v in the Bayesian estimators. 

In Table 6.9 we give values for the regression-estimate mean 
squared error (risk) over 200 trinomial simulations for both replications. 
For all PID and SS variations, the posterior mean has the smallest mean 
squared error for p 2 , p^, and p^; the posterior mode, for p^. Therefore, 
results indicate that when the correct prior is used in the Bayesian 
estimators, the posterior mean is the best estimator for all probabilities 
except those on a boundary of the P 2 simplex. For these boundary proba- 
bilities, the posterior mode is the best estimator, although the differ- 
ence between the posterior mode and the posterior mean decreases as 
sample size increases. The maximum likelihood estimate is always the 
worst estimator. 

To determine significant effects in Table 6.9, we next present re- 
sults from analysis of variance on the natural logarithms of these mean 
squared errors. Table 6.10A shows a huge F value (22,461) for the main 
effect of p, very large F values (2172 and 1654, respectively) for main 
effects of sample size and estimator, and a high F value (92, 6df) for 
the PxEST interaction. Hence, as Snedecor and Cochran ( 1968 , p344 ) and 
Steel and Torrie (1960, p207) imply, the significant three-factor inter- 
action PxSSxEST mi ght mean only that there is a minor change in PxEST as 
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SS varies. Similarly, the large F value for EST relative to that for 
the two-factor interaction PIDxEST might mean only that there is a minor 
change in EST as PID varies. 

Plots of PIDxEST and PxESTxSS in Part.B of Table 6.10 generally 
show this premise to be true. Values for the plots were calculated by 
summing over nonpresent factors (including replication) in Table 6.9 
after natural logarithms had been taken. The PIDxEST plot shows that, 
summed over all factors but PID, APM is the best estimator and MLE, the 
worst. As PID increases, all three estimators become worse. The 
PxESTxSS plots show that the posterior mean is best for p^, p 3 , and p^ 
when SS=25. The posterior mode is best for p^. However, it does not 
differ greatly from the posterior mean. When sample size increases to 
50, the posterior mode and posterior mean become approximately equal 


at p ls p 3 , and p 4 - 


The maximum likelihood estimate is everywhere the 


worst estimator. 


To determine how much risk in Table 6.10B is reduced by using the 
best estimator, we made a rough translation from log e (mse) back to mse 
in the following way. Let vl and v2 denote the risk of an estimator 
for replications 1 and 2 (rl and r2), respectively. Then, 


log e (vl) + log e (v2) = log e (vlx V 2). 

Let wl and w2 denote the corresponding risk of a second estimator. 

Then the difference between the summed natural logarithms in the plots 
of these two estimators is 


log e (vlxv2) - log e (wlxw2) = log g [ (vlxv2)/(wlxw2) ] 

= 1 og e [ ( vl/wl ) ( v2/w2 ) ] 
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and 

exp[log e (vlxv2)-log e (wlxw2) ] = (vl/wl)(v2/w2) . 

Table 6.9 shows that risk differs little between replications rl and r2; 
i.e., v l=v2 and wl=w2. Therefore, we can approximate the ratio of the 
risk of one estimator to that of another estimator by the square root 
of the last equation; i.e., by 

/{exp[log e (vlxv2)-log e (wlxw2)] }. 

Again note that log e (vlxv2) is the value plotted in Table 6.10B for an 
estimator. 

Using this basis, then, to roughly translate results from log e (risk) 
to risk, we find that plots in Table 6.10B show that use of the correct 
estimator reduced risk by about one-fourth (almost one-half at p^) over 
use of the next best estimator and by slightly more than one-half over 
use of the worst estimator when the sample size (SS) was 25. Corres- 
ponding reduction in risk when the sample size was 50 was 10% - 15% 

(25% - 32% at p£) and 35% - 40%, respectively. 

To study further the mean squared errors at p-^, we broke the mean 
squared error into its 200 components corresponding to the individual 
trinomial simulations. We then calculated which estimator had the 
smallest squared error for each of these simulations. From results of 
the last plot, we would expect the proportion at p-^ to be highest for 
the posterior mode. However, the proportion of simulations in which the 
posterior mean had the smallest squared error was two to four times 
higher than that for the posterior mode! This discrepancy indicates 
that when the posterior mode is_ best, it is best by a much greater 
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amount than when the posterior mean is best. 

Finally, we investigated bias for the first two components for 

those estimators having approximately equal risk, where bias was 
200 . 

estimated by E (p..-p.)/200 for i = 1 , 2 . Results showed that, except 
j = l J 

for the trimean for p^, all central values (mean, median, and trimean) 
for the errors P^j'P-j were smallest for the posterior mean, even at p^. 

For each of the first two components, all estimators at p^ had 
approximately one-half negative and one-half positive errors. Except at 
p^, proportions of negative errors were noticeably higher for the poster- 
ior mode than for the posterior mean or maximum likelihood estimate. 
Proportions for the latter two were always close and were often identical. 
As p moved from the center toward the corner of the P 9 simplex or as PID 

~ C. 

increased, the proportion of negative errors for each component usually 
increased. As sample size increased, proportions moved toward a 50/50 
ratio. 

6.5.3 Uniform Prjo£ i.n_Bayesi_ar^ ^stimators^ 

In this subsection, we discuss results from robustness study R1 
where we used the uniform prior (1,1,1), instead of the correct prior 
v, in the Bayesian estimators. For this uniform prior, the posterior 
mode equals the maximum likelihood estimate. Hence, we have only two 
estimators for this robustness study. 

In Table 6.9 we give values for the regression-estimate mean 
squared error (risk) over 200 trinomial simulations for both replica- 
tions. For p^ and p^ for both levels of sample size and both levels of 
percentage of incomplete data, the posterior mean has the smaller mean 
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squared error, although differences tend to be small. For p^, the 
posterior mode (maximum likelihood estimate) has the smaller mean 
squared error. For the posterior mode (mle) has the smaller mean 

squared error for P I D= 15 and the posterior mean, for PID=40. However, 
the difference at P 2 between the two estimators is very small. 

Therefore, results indicate that when a uniform prior is used in 
the Bayesian estimators, the posterior mode (mle) is the better 
estimator for probabilities at or near a boundary of the P 2 simplex. 

The posterior mean is the better estimator for all other probabilities. 

To determine significant effects in Table 6.9, we next present in 
Table 6.11 results from an analysis of variance on the natural logarithms 
of these mean squared errors. As for the original prior, F values for 
P*EST and SS were so large relative to those for PxSSxEST that we 
expected the significance for the latter to reflect mainly a variation 
in PxEST for the two levels of sample size. The plot in Part B of 
Table 6.11 shows this to be true. Estimators have larger negative 
log e (mse) at SS=50, but curves at the two sample sizes are similar. 

The plot also shows that the difference between the two estimators at 
p^ is large relative to the difference at the other three values of p. 
Finally, as expected, differences between estimators decrease as sample 
size increases. 

Using the rough translation given in Section 6.5.3 for log e (mse), 
we find that plots in Table 6.11 show that the largest reduction in risk 
occurred at the corner probability p^. At p^ the risk of the posterior 
mean was almost six times larger than that of the posterior mode (mle) 
when the sample size was 25; almost four times larger, when the sample 
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size was 50. At p^, however, the risks of the estimators were almost 
equal. For p 2 and p^, the risk of the posterior mean was about 25% 
smaller than that of the posterior mode (mle) when the sample size was 
25 and 15% smaller, when the sample size was 50. 

As noted in Section 6.4.2, it is possible for an estimator to have 
smaller mse but not have smaller squared error for most of the 200 tri- 
nomial simulations. Hence, we next studied several estimator character- 
istics for each of the 200 trials. Since p 2 seemed to be a crossover 
probability for which estimator was better, results for p 2 were of spe- 
cial interest. They showed that each estimator was better approximately 
50% of the time in terms of squared error. However, in terms of per- 
centage relative difference, the posterior mean was the better estimator 
for two-thirds of the trials. 

An investigation of the estimated bias found that central values for 
the individual errors were smaller for the posterior mode (mle) than for 
the posterior mean at p 2 . Further, in all cases, the posterior mode was 
slightly closer to a 50/50 ratio of positive errors to negative errors 
than was the posterior mean. The posterior mean had a higher proportion 
of positive errors. 

6.5.4 Perturbed^ Prior i_n_Baye_si_an_ £st_imatorsj_ 

In this subsection, we discuss results from robustness study R2 

where we used the perturbed prior 10*[v/10+( .09 , .05 14) ] , instead of 

* 

the correct prior v, in the Bayesian estimators. 

Table 6.9 gives values for the regression-estimate mean squared 
error (risk) over 200 trinomial simulations for both replications. 
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Results are similar to those for the uniform prior. The posterior 
mean is best for P 2 and p^. The posterior mode is best for p 1 and 
usually best for P 2 * The maximum likelihood estimate is the worst 
estimator at p£, Py and p^; the posterior mean, at p^. 

Therefore, results indicate that when a very wrong prior is used, 
the posterior mode is the best estimator for probabilities at or near a 
boundary. However, the posterior mean will still be the best estimator 
for probabilities away from the boundary. 

To determine significant effects among variables in Table 6.9, we 
next performed an analysis of variance on the natural logarithms of the 
mean squared errors. Significant F values are given in Table 6.12. 
Plots of the significant PIDxEST and PxESTxSS interactions are given in 
Part B of Table 6.12. The PIDxEST plot shows that, when summed over p, 
SS, and replication, the posterior mode PMD is the best estimator, 
followed by APM and MLE. [However, when this analysis was done on the 
original mean squared errors rather than on log e (mse), the posterior 
mean, not the posterior mode, was best.] As expected, all estimators 
worsen as the percentage of incomplete data increases. However, the 
difference between estimators is almost constant as PI D changes. 

The plot of PxESTxSS shows that, when summed over PID and replica- 
tion, the posterior mode is best for p^ and p£, the posterior mean is 
best for p^, and the posterior mode and posterior mean are equally best 
for Py Except, at p^, the maximum likelihood estimate is the worst 
estimator. Estimators improve and differences between estimators 
decrease as sample size increases. 
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By using the log e (mse) translation given in Section 6.5.3, we find 
that plots in Table 6.12 show that, as for the uniform prior, the largest 
reduction in risk occurred at the corner probability p^. At p^ the risk 
of the posterior mean was over four times larger than that of the pos- 
terior mode when the sample size was 25 and almost three times larger, 
when the sample size was 50. The risk of the maximum likelihood estimate 
was twice larger than that of the posterior mode when the sample size 
was 25 and about 64% larger, when the sample size was 50. At p^, the 
risk of the posterior mode and posterior mean were almost equal. The 
risk of the maximum likelihood estimate was close to one-half that of 
the posterior mode when the sample size was 25 and about 72 % that of 
the posterior mode when the sample size was 50. At p^, the risk of the 
posterior mean was only slightly smaller than that of the posterior mode 
but was about one-half that of the maximum likelihood estimate when the 
sample size was 25 and about 70% that of the maximum likelihood estimate 
when the sample size was 50. At p^, the risk was reduced about 20% by 
using the posterior mean instead of the posterior mode when the sample 
size was 25; about 12%, when the sample size was 50. The relationship 
between the risk for the posterior mean and maximum likelihood estimate 
was the same as it was for p^. 

As for the original and uniform priors, we next examined several 
additional properties of the estimators. The most important result was 
that, when MLE or PMD had smallest risk, it was generally because, when 
it had smallest squared error for one of the 200 trinomial simulations, 
the difference between it and APM's squared error was much larger than 
the difference when APM was best. This larger difference usually owed 
to APM, having nonzero prior, never being zero. 
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6.5.5 Cojnclus, i o n _s : 

We now conclude results from the three studies for minimizing risk, 
and we recommend an operating rule. In this section, we are interested 
in choosing which of the three estimators is best for minimizing risk 
(expected quadratic loss). As anticipated from the introductory discus- 
sion in Section 6.5.1, this minimizing estimator was a function of the 
probability (or probability mean in the Bayesian framework) that was 
being estimated. 

Summary results from these three studies are given in Tables 6.13 
and 6.14. In Table 6.13 we give the ratio of the estimated mean squared 
errors for the posterior mean to those of the posterior mode and to those 
of the maximum likelihood estimate. A ratio of less than 1 means that 
the posterior mean is best. In Table 6.14, we condense results from 
Table 6.13 and give the estimator having the smallest mean squared error 
(risk) . 

If we use the correct prior, results indicate that the posterior 
mean is best for all values of p except those very near a boundary of 
the simplex. Even very near a boundary, results for the posterior 
mean differed little from those for the best estimator, the posterior 
mode, especially for a sample size of 50. [See Plot 6.10B and Table 
6.13.] When the sample size was 25, risk was usually reduced by one- 
fourth if the best estimator was used instead of the next best estimator 
and by one-half if the best estimator was used instead of the worst 
estimator, the maximum likelihood estimate. These reductions decreased 
to about 12% and 38%, respectively, when the sample size doubled. 
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If we do not have, or want to use, past knowledge for estimating a 
prior and instead use a uniform prior, in which case the posterior mode 
equals the maximum likelihood estimate, then results indicate that the 
posterior mode (mle) is best for points very near a boundary and, for 
P I D= 15, those near a boundary. The posterior mean is best everywhere 
else. The crossover point is approximately p^, where estimated mean 
squared errors for the posterior mean and posterior mode are almost equal. 
[See Plot 6.11B and summary tables, Tables 6.13 and 6.14.] In this ro- 
bustness study, the largest reduction in risk occurred at the corner 
probability where risk was reduced by five-sixths if the posterior 
mode (mle) was used instead of the posterior mean when the sample size 
was 25. When the sample size was 50, the reduction was three-fourths. 

For p^ and p^, risk was reduced about one-fourth by using the posterior 
mean instead of the posterior mode (mle) when the sample size was 25; 
by one-seventh, when the sample. size was 50. 

For an estimate of the prior that is very poor, conclusions are 
similar to those for the uniform prior. The posterior mode is best at 
or near a boundary; the posterior mean, elsewhere. The main difference 
is that the crossover point is a little closer toward the center of P^. 
[See Plot 6.12B and summary tables. Tables 6.13 and 6.14. In particular, 
observe how similar curves in Plot 6.12B are to those in Plot 6.10B.] 

In this robustness study also, the largest reduction in risk occurred 
at the corner probability p^. At p^, risk was reduced by three-fourths 
when the posterior mode was used instead of the posterior mean when the 
sample size was 25. When the sample size was 50, the reduction was 
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two-thirds. At the center of P^, risk was reduced by about 20% when 
the posterior mean was used instead of the posterior mode when the sample 
size was 25; about 12% when the sample size was 50. Otherwise, at p^ and 
p^> the risk of the posterior mean and posterior mode differed little. 

Use of the best estimator instead of the maximum likelihood estimate 
usually reduced risk by one-half when the sample size was 25 and one- 
third when the sample size was 50. 

Recall the centrality measure C(p) that we defined in equation (5.2). 
This norm is a measure of the distance a probability is from the center 
of the simplex. For the four values, p^=( .01, .01, .98), p^.lOj.lO, 
.80), p^=( .20, .30, .50), and p^s{l/3,l/3,l/3) of p in the simulation study, 
centrality measures were 1.88, .98, .14, and 0, respectively. In general, 
probabilities nearest a boundary have a centrality measure larger than 1. 

When we used the uniform prior or the badly estimated prior in the 
robustness studies, the crossover point for which estimator was best lay 
between p^ and p^. Between p^ and the crossover point, however, there 
was little difference between results for the posterior mode, the best 
estimator, and those for the posterior mean. Further, for priors that 
are not as badly estimated as were those in the second robustness study, 
we expect the crossover point to be closer to P 2 or, based on Plot 6.10B 
for the correct prior, possibly between p^ and p^. 

Since p^ has a centrality measure of .98, we recommend, as an operat- 
ing rule, use of the posterior mean if the centrality measure of p is less 
than 1 and the posterior mode, otherwise. This operating rule is a func- 
tion of p and in practice, of course, we do not know p. Hence, we can 
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not calculate the exact centrality measure. However, for any estimate 

v of the prior, we can approximate the centrality measure by C(p) = 

~ ^ k+1 

C(v/ Z v.). In those cases having no estimate of the prior, we could 
~ j=l J 

use a uniform prior and, thus, approximate C(p) by 0. 

Note that the maximum likelihood estimate was everywhere the worst 
estimator when the correct prior was used in the Bayesian estimates. 

Even when a very poor estimate of the correct prior was used [robustness 
study R2], the maximum likelihood estimate was the worst estimator every- 
where except very near a boundary where it was second best. 

As sample size increased, the difference between the estimators de- 
creased. A sample size of 50 was large enough for some of the estimators 
in some cases to be approximately equal. As the percentage of incomplete 
data increased, all estimators worsened. However, the difference between 
estimators did not significantly change. 
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6.6 Summary : 

In this chapter we gave results of Design 1 in the simulation study. 
In the first half we discussed Taylor-series approximations for elements 
of the posterior mean and covariance matrices. These approximations were 
needed for the second half of the study. In the second half, we reported 
which of the posterior mean, posterior mode, and maximum likelihood esti- 
mate best minimized risk for quadratic loss at specified values of the 
population probabi 1 ity (or probability mean in the Bayesian framework). 
Conclusions and recommendations were given at the end of each of these 
discussions. 

Briefly, the Taylor-series approximations were excellent except for 
some of those cases simultaneously having inconsistent data, zero obser- 
vations for two of the three completely specified cells, and high percen- 
tage (40% - 60%) of incomplete data. Even in these rare cases, the 
approximations are probably satisfactory considering the inherent uncer- 
tainty associated with estimating nonzero probabilities from zero data. 

In nearly all cases, the approximations were accurate to at least two 
significant figures. The approximation for elements of the posterior 
mean vector was even better. In most cases, it was accurate to at least 
four significant figures. 

The risk study indicated that the posterior mean is the best esti- 
mator for all values of the probability p except those very near a bound- 
ary of the simplex if we use the correct prior in the Bayesian esti- 
mates. The posterior mode is best at a boundary. However, it does not 
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differ much from the posterior mean. If, however, we use a uniform prior 
or a bad estimate of the correct prior, then the posterior mode is the 
best estimator for values at or near a boundary of the simplex; the 
posterior mean, elsewhere. By using the best estimator, risk was usually 
reduced by one-fourth over that of the next best estimator and by one- 
half over that of the worst estimator (nearly always the maximum likeli- 
hood estimate) when the sample size was 25. Corresponding reductions 
when the sample size was 50 were one-eighth and three-eighths, respec- 
tively. At a corner p^( .01, .01, .98) , however, the reduction was much 
larger when an incorrect prior was used in the Bayesian estimators; the 
risk was reduced by as much as five-sixths when the posterior mode was 
used instead of the posterior mean. 

In the last section we gave the following operating rule for deter- 
mining which estimator to use in practice: use the posterior mean if the 

centrality measure calculated from your estimate of the prior is less 
than 1; otherwise, use the posterior mode. 





PROPORTION OF 200 TRINOMIAL SIMULATIONS FOR WHICH PERCENT ABSOLUTE RELATIVE DIFFERENCE 1 FOR EACH COMPONENT IS LESS THAN SPECIFIED AMOUNTS. 

EPM DIFFERENCE. DESIGN 1. 
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table 6.5 

BIAS AND MSE RATIOS FOR EPM COMPARISONS. DESIGN 1. 


Sample Size 


SS= 

25 - 



SS 

=50 


% Inc. Data 

P I D= 

15 

P I D= 

: 40 

P I D= 

15 

PID=40 


Replic. No. 

rl 

r2 

rl 

r2 

rl 

r2 

rl 

r2 

★ 

P Approx. 










A. RATIO OF BIAS(APM) TO BIAS(PMD) 

AND BIAS(MLE) 

FOR EPM 

COMPARISONS 


P : PMD 

MLE 

-.13 -2 
-.69 -1 

-.21.-2 
.45 -1 

-.87 -2 
-.52 -1 

.44 -2 
.89 -1 

-.24 -2 
.81 -1 

.43 -3 
-.10 -1 

-.11 -1 
.14. 0 

-.42 -2 
.55 -1 

P 2 PMD 

MLE 

-.33 -3 
-.41 -2 

-.25 -3 
.19 -1 

-.25 -2 
.14 -1 

-.18 -2 
-.14 0 

-.26 -3 
.12 0 

-.18 -3 
-.60 -2 

-.37 -2 
-.12 0 

-.30 -2 
-.38 -1 

p 3 PMD 

MLE 

-.54 -3 
.35 -2 

-.10 -2 
-.22 -1 

-.35 -2 
-.24 -1 

-.48 -2 
.32 -1 

-.57 -3 
-.58 -2 

-.53 -3 
.92 -2 

-.81 -2 

.16 1 

-.83 -2 
.45 -1 

p . PMD 

L4 

MLE 

-.48 -2 
-.11 -2 

.70 -2 
.17 -2 

-.12 -1 
-.30 -2 

.89 -1 
.19 -1 

.17 -2 
.44 -3 

.41 -2 
.11 -2 

.47 -1 
.13 -1 

-.18 -1 
-.45 -2 


B. RATIO OF MSE(APM) TO MSE(PMD) AND MSE(MLE) FOR EPM COMPARISONS 


Pi 

PMD 

.28 -3 

.45 -4 

.20 -2 

.31 -2 

.93 -4 

.15 -3 

.34 -2 

.12 -2 

MLE 

.13 -2 

.23 -3 

.68 -2 

.71 -2 

.17 -2 

.24 -2 

.23 -1 

.76 -2 

P 2 

PMD 

.78 -6 

.24 -5 

.64 -4 

.58 -4 

.15 -5 

.12 -5 

.50 -4 

.84 -4 


MLE 

.18 -5 

.59 -5 

.10 -3 

.92 -4 

.71 -5 

.48 -5 

.13 -3 

.24 -3 

p 3 

PMD 

.64 -5 

.12 -4 

.22 -3 

. 20 -3 

.87 -5 

.53 -5 

.27 -3 

.27 -3 


MLE 

.15 -5 

.27 -5 

.40 -4 

.37 -4 

.32 -5 

.22 -5 

.84 -4 

.87 -4 

P 4 

PMD 

.36 -4 

.31 -4 

.67 -3 

.69 -3 

.37 -4 

.32 -4 

.11 -2 

.92 -3 


MLE 

.18 -5 

.16 -5 

.29 -4 

.31 -4 

.25 -5 

.21 -5 

.68 -4 

.57 -4 


*Dirichlet probability (expected value of the Oirichlet distribution of p given prior parameters 

v r v 2* v 3* and v 4’ res P ectivel y) 
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TABLE 6.6A 

ANALYSIS OF VARIANCE FOR ESTIMATED EPM BIAS 


SOURCE 

D.F. 

SUM OF SO. 

MEAN SO. 

F 

P 

3 

.658393 -3 

.219464 -3 

201.924 *** 

ss 

1 

.539909 -4 

.539909 -4 

49.676 *** 

PID 

1 

.794404 -5 

.794404 -5 

7.309 *** 

EST 

2 

.211893 -2 

.105946 -2 

974.787 *** 

PxSS 

3 

.686198 -4 

.228733 -4 

21.045 *** 

PxpID 

3 

.634564 -5 

.211521 -5 

1.946 

PxEST 

6 

.104191 -2 

.173652 -3 

159.773 *** 

SSxPID 

1 

.219305 -5 

.219305 -5 

2.018 

SSxEST 

2 

.879171 -4 

.439586 -4 

40.445 *** 

PIDxEST 

2 

.642627 -5 

.321314 -5 

2.956 * 

PxSSxPID 

3 

.452496 -6 

.150832 -6 

.139 

PxSSxEST 

6 

.114745 -3 

.191242 -4 

17.596 *** 

PxPIDxEST 

6 

.125370 -4 

.208950 -5 

1.922 * 

SS X PID X EST 

2 

.135775 -5 

.678876 -6 

.625 

PxSSxPIDxEST 

6 

.621923 -6 

.103654 -6 

.095 

ERROR 

48 

.521696 -4 

.108687 -5 


TOTAL 

95 

.423455 -2 




* Significant at 10% level. 

*** Significant at 1% level. 

Note that exponential notation is used for the third and fourth 
columns; for example, .00423455 is written as .423455 -2. 
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TABLE 6.7A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED EPM 

MEAN SQUARED ERROR 


SOURCE 

D.F. 

SUM OF SQ. 

MEAN SQ 

. 

F 

P 

3 

.127603 

2 

.425343 

1 

52.797 *** 

SS 

1 

.354935 

2 

.354935 

2 

440.575 *** 

PID 

1 

.569041 

2 

.569041 

2 

706.342 *** 

EST 

2 

.199362 

4 

.996812 

3 

12,373.269 *** 

P*SS 

3 

.191251 

1 

.637503 

0 

7.913 *** 

P*P1D 

3 

.886197 

0 

.259399 

0 

3.667 ** 

P*E$T 

6 

.121670 

3 

.202784 

2 

251.713 *** 

SSxpID 

1 

.162900 

0 

.162900 

0 

2.022 

SSxEST 

2 

.219450 

1 

.109725 

1 

13.620 *** 

PIDxEST 

2 

.540537 

2 

.270269 

2 

335.480 *** 

PxSSxPID 

3 

.242087 

0 

.806956 

-1 

1.002 

Px SSxEST 

6 

.317632 

0 

.529387 

-1 

.657 

PxPIDxEST 

6 

.984097 

0 

.164016 

0 

2.036 * 

SSxpIDxEST 

2 

.287500 

-1 

.143750 

-1 

.178 

PxSSxPIDxEST 

6 

.228965 

0 

.381608 

-1 

.474 

ERROR 

48 

.386696 

_1 

.805617 

-1 


TOTAL 

95 

.228533 

4 





* Significant at 10% level. 

** Significant at 5% level. 

*** Significant at 1% level. 







PROPORTION OF 200 TRINOMIAL SIMULATIONS IN WHICH PERCENT ABSOLUTE RELATIVE DIFFERENCE BETWEEN TAYLOR-SERIES APPROXIMATION (T.S. APC) AND 

EXACT (EPC) POSTERIOR COVARIANCES IS LESS THAN SPECIFIED AMOUNTS. DESIGN 1. 
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| APCI I -EPCI I |/EPCII*100 for II = 11, 12, and 22; £PC denoting approximated posterior covariance; and EPC denoting exact posterior 
covariance; note that EPCI I was never zero in Design 1. CIO is the I,Jth element of the posterior covariance matrix. 



MEAN SQUARED ERRORS FOR RISK STUDY. DESIGN 1. 
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Note that values are given in scientific notation; e.g., .00080 is written as .80-3. Values in parenthesis are standard errors. 

For P I D= 1 5 and 40, mean squared error is estimated by the regression estimate [see Section 5.9] 

‘Values were not calculated (see main text) 
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TABLE 6.10A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED RISK 

FOR ROBUSTNESS SET 0 


SOURCE 

D.F. 

SUM OF SQ. 

MEAN SQ. 

F 

P 

3 

.147185 3 

.490617 2 

22,460.893 *** 

SS 

1 

.474330 1 

.474330 1 

2,171.526 *** 

PID 

1 

.645378 0 

.645678 0 

295.460 *** 

EST 

2 

.722603 1 

.361302 1 

1,654.071 *** 

PxSS 

3 

.697938 -1 

.232646 -1 

10.651 *** 

PxPID 

3 

.717094 -1 

.239031 -1 

10.943 *** 

PxEST 

6 

.120826 1 

.201377 0 

92.192 *** 

SSxPID 

1 

.535130 -1 

.535130 -1 

24.499 *** 

SSxEST 

2 

.628491 0 

.314245 0 

143.864 *** 

PIDxEST 

2 

.768419 -1 

.384210-1 

17.589 *** 

PxSSxPID 

3 

.883219 -2 

.294406 -2 

1.348 

PxSSxEST 

6 

.155608 0 

.259346 -1 

11.873 *** 

PxPIDxEST 

6 

.824202 -2 

.137367 -2 

.629 

SSxPIDxEST 

2 

.433219 -2 

.216610 -2 

.992 

PxSSxPIDxEST 

6 

' .127261 -2 

.212102 -3 

.097 

ERROR 

48 

.104847 0 

.218432 -2 


TOTAL 

95 

.162192 3 




*** Significant at 1% level. 




Values are sums over nonpresent factors, including replication. 
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TABLE 6.11A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED RISK 

FOR ROBUSTNESS SET 1 


SOURCE 

D.F . 

SUM OF SO. 

MEAN SQ. 

F 

P 

3 

.427955 2 

.142652 2 

4967.065 *** 

SS 

1 

.779571 1 

.779571 1 

2714.429 *** 

PID 

1 

.766894 0 

.766894 0 

267.029 *** 

EST 

1 

.129177 1 

.129177 1 

449.789 *** 

PxSS 

3 

.196884 0 

.656280 -1 

22.851 *** 

PxPID 

3 

.115092 -1 

.383639 -2 

1.336 

PxEST 

3 

.866720 1 

.288907 1 

1005.959 *** 

SSxPID 

1 

.107157 -1 

.107157 -1 

3.731 * 

SSx'EST 

1 

.122873 -1 

.122873 -1 

4.278 ** 

PIDxEST 

1 

.570937 -2 

.570937 -2 

1.988 

PxSSxPID 

3 

.802564 -2 

.267521 -2 

.931 

PxSSxEST 

3 

.244437 0 

.814791 -1 

28.371 *** 

PxPIDxEST 

3 

.131641 -1 

.438803 -2 

1.528 

SSxPIDxEST 

1 

.326823 -3 

.326823 -3 

.114 

PxSSxPIDxEST 

3 

.523357 -2 

.174452 -2 

.607 

ERROR 

32 

.919024 -1 

.287195 -2 


TOTAL 

63 

.619173 2 




* Significant 

at 

10% 

level . 

** Significant 

at 

5% 

level . 

*** Significant 

at 

1% 

level . 
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TABLE 6.12A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED RISK 

FOR ROBUSTNESS SET 2 


SOURCE 

D.F . 

SUM OF SO- 

MEAN SQ. 

F 

P 

3 

.106983 3 

.356610 2 

13981.332 *** 

SS 

1 

.697137 1 

.697137 1 

2733.214 *** 

PID 

1 

.755904 0 

.755904 0 

296.361 *** 

EST 

2 

.320158 1 

.160079 1 

627.610 *** 

PxSS 

3 

.577776 -1 

.192592 -1 

7.551 *** 

PxPID 

3 

.266412 -1 

.888039 -2 

3.482 ** 

P*E$T 

6 

.613926 1 

.102321 1 

401.162 *** 

SSxPID 

1 

.422372 -1 

.422372 -1 

16.560 *** 

SSxEST 

2 

.249801 0 

.124901 0 

48.969 *** 

PIDxEST 

2 

.449282 -1 

.224641 -1 

8.807 *** 

PxSSxPID 

3 

.118694 -1 

.395645 -2 

1.551 

PxSSxEST 

6 

.198784 0 

.331307 -1 

12.989 *** 

PxPIDxEST 

6 

.121991 -1 

.203318 -2 

.797 

SSxPIDxEST 

2 

.569427 -2 

.284714 -2 

1.116 

PxSSxPIDxEST 

6 

.943908 -2 

.157318 -2 

.617 

ERROR 

48 

.122429 0 

.255061 -2 


TOTAL 

95 

.124833 3 




** Significant at 5% level 

*** Significant at 1 % level 




ues are sums over nonpresent factors, including replication 



TABLE 6.13 


RATIO OF MSE(APM) TO MSE(PMD) AND MSE(MLE) FOR QUADRATIC-LOSS COMPARISONS. 


Sample 

Size 




SS=25 








SS= 

: 50 




% Inc. 

Data 

P I D= 

'15 


PID=40 



P I D= 

15 



PID= 

'40 


Refills 

. No. 

rl 


r2 


rl 


r2 


rl 


r2 


rl 


r2 


?! Ill 

imator 



















A. ROBUSTNESS 

SET 0 

(ORIGINAL 

PRIOR 

USED 

IN BAYESIAN ESTIMATORS) 




D. 

PMD 

.14 

1 

.14 

1 

.14 

1 

.13 

1 

.10 

1 

.12 

1 

.11 

1 

.11 

1 

Cl 

MIE 

.47 

0 

.48 

0 

.43 

0 

.41 

0 

.67 

0 

.66 

0 

.60 

0 

.60 

0 

n 

PMO 

.58 

0 

.60 

0 

.52 

0 

.48 

0 

.78 

0 

.73 

0 

.73 

0 

.65 

0 


MLE 

.48 

0 

.48 

0 

.39 

0 

.39 

0 

.67 

0 

.67 

0 

.61 

0 

.60 

0 

n 

PMD 

.80 

0 

.80 

0 

.75 

0 

.77 

0 

.89 

0 

.87 

0 

.84 

0 

.85 

0 

S3 

MLE 

.47 

0 

.47 

0 

.39 

0 

.40 

0 

.66 

0 

.66 

0 

.60 

0 

.60 

0 

D . 

PMD 

.82 

0 

.82 

0 

.79 

0 

.79 

0 

.89 

0 

.39 

0 

.87 

0 

.87 

0 

T4 

MLE 

.47 

0 

.47 

0 

.40 

0 

.40 

0 

.67 

0 

.67 

0 

.59 

0 

.60' 

0 



B. 

ROBUSTNESS 

SET 1 

(UNIFORM 

PRIOR 

USED 

IN BAYESIAN ESTIMATORS) 




Pi 

PMD* 

.55 

1 

.56 

1 

.64 

1 

.63 

1 

.37 

1 

.38 

1 

.36 

1 

.38 

1 

P.2 

PMD 

.10 

1 

.11 

1 

.98 

0 

.95 

0 

.11 

1 

.10 

1 

.10 

1 

.92 

0 

P.3 

PMD 

.79 

0 

.80 

0 

.73 

0 

.76 

0 

.90 

0 

.88 

0 

.85 

0 

.36 

0 

Pa 

PMD 

.77 

0 

.77 

0 

.72 

0 

.72 

0 

.88 

0 

.88 

0 

.84 

0 

.85 

0 



C. ROBUSTNESS 

SET 2 

(PERTURBED PRIOR USED IN 

BAYESIAN ESTIMATORS) 




pi 

PMD 

.41 

1 

.41 

1 

.50 

1 

.44 

1 

.28 

1 

.31 

1 

.25 

1 

.28 

1 

MLE 

.20 

1 

.21 

1 

.21 

1 

.21 

1 

.18 

1 

.19 

1 

.17 

1 

.18 

1 

n 

PMD 

.11 

1 

.12 

1 

.11 

1 

.11 

1 

.11 

1 

.10 

1 

.10 

1 

.96 

1 

?2 

MLE 

.64 

0 

.69 

0 

.55 

0 

.55 

0 

.82 

0 

.78 

0 

.72 

0 

.66 

0 


PMD 

.92 

0 

.92 

0 

.90 

0 

.92 

0 

.95 

0 

.94 

0 

.92 

0 

.94 

0 

?3 

MLE 

.56 

0 

.59 

0 

.49 

0 

.52 

0 

.76 

0 

.71 

0 

. 65 

0 

.68 

0 

Pa 

PMD 

.82 

0 

.82 

0 

.79 

0 

.79 

0 

.89 

0 

.89 

0 

.87 

0 

.87 

0 

MLE 

.58 

0 

.58 

0 

.52 

0 

.53 

0 

.76 

0 

.72 

0 

.68 

0 

.70 

0 


*For uniform prior, PMD=MLE 
^Dirichlet probability 
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TABLE 6.14 

ESTIMATOR HAVING SMALLEST AVERAGE ESTIMATED MEAN SQUARED ERROR 
FOR QUADRATIC-LOSS COMPARISON. 


Sample 

Size 

SS= 25 



SS= 

=50 

% Inc. 

Data 

PID=15 

PID= 

=40 

P I D= 

15 

PID=40 

Replic. 

. No. 

rl r2 

rl 

r2 

rl 

r2 

rl r2 

Dir. Prob. 









A. ROBUSTNESS SET 0 (ORIGINAL PRIOR IN ESTIMATORS) 

Ei 


pmd pmd 

pmd 

pmd 

pmd 

pmd 

pmd pmd 

?2 


apm apm 

apm 

apm 

apm 

apm 

apm apm 

S3 


apm apm 

apm 

apm 

apm 

apm 

apm apm 

84 


apm apm 

apm 

apm 

apm 

apm 

apm apm 



B. ROBUSTNESS 

SET 1 (UNIFORM PRIOR IN ESTIMATORS) 

Ei 


i 

pmd pmd 

pmd 

pmd 

pmd 

pmd 

pmd pmd 

e 2 ! 


pmd pmd 

apm 

apm 

pmd 

pmd 

apm apm 

E3 


apm apm 

apm 

apm 

apm 

apm 

apm apm 

P4 


apm apm 

apm 

apm 

apm 

apm 

apm apm 



C. ROBUSTNESS 

SET 

2 (PERTURBED PRIOR IN ESTIMATORS) 

Ei 


pmd pmd 

pmd 

pmd 

pmd 

pmd 

pmd pmd 

E2 ! 


pmd pmd 

pmd 

pmd 

pmd 

pmd 

pmd apm 

E3 


apm apm 

apm 

apm 

apm 

apm 

apm apm 

E4 


apm apm 

apm 

apm 

apm 

apm 

apm apm 


1 

pmd = mle for uniform prior 

2 

pmd and apm are nearly equal for all conditions for p„ for 
Robustness sets 1 and 2. Recall Table 6.13. 



CHAPTER 7 


RESULTS OF DESIGN 2 


7. 1 Introduction : 

In this chapter, we report results from Design 2. We want to know 
whether risk results from Design 1 depend on the very special choice 
used there for the trinomial generator probabilities. Recall that each 
of the four trinomial generators was the mean of a prior Diriclilet 
distribution. This chapter reports what happens when, instead, we 
choose probabilities randomly generated from the Dirichlet distribution 
for these trinomial generators. [See Figures 5.1 - 5.3 for a comparison 
of Design 2 with Design 1.] 

Note that, except for a brief discussion in the next section, we do 
not report work on the Taylor-series approximations. Results of Design 
1 show that risk conclusions depend on the value of the generator p. 
However, the accuracy of the Taylor-series approximations, although 
depending slightly on the value of the generator p, was good for all 
values of p. Rare exceptions occurred at some of those boundary values 
that gave empty cells for the completely specified data when the percen- 
tage of incomplete data was high. Although some of the calculations 
discussed in Section 6.4 for Design 1 were repeated for Design 2, 
results were identical to those already reported. 

Other than the generator probabilities p, factors in Design 2 were 
the same as those in Design 1. There were four values of the prior 
parameter v: v^=( . 1 ,. 1,9.8) , V 2 =(l,l, 8 ), Vg=(2,3,5), and v^( 10/3 ,10/3, 

10/3). Sample sizes were SS=25 and SS=50. The percentage of 
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incomplete data PID varied around PID=15 and PID=40. The three estimators 
were the posterior mean (approximated by the Taylor-series expansion), 
the posterior mode, and the maximum likelihood estimate. 

As in Design 1, we had three robustness studies, one each for use in 
the Bayesian estimates of the original prior v, the uniform prior (1,1,1), 
and the perturbed prior 10x[v/10+(.09,.05,-. 14)]. Note, again, that the 
maximum likelihood estimate was the same for all three studies. 

Recall from Section 5.6 that cost constraints limited to 10 the 

• • 

number of Dirichlet generations of p given each of the four values of v. 
Values of these probabilities, generated by the procedures described in 
Section 5.7.2, are given in Table 7.1. As expected, the generated values 
varied around the means (.01,.01,.98), ( . 10, . 10, .80) , ( .20, .30,. 50) , and 
(.33,. 33,. 33) of the prior distribution of p given Vj, v 3 , and v^, 
respectively. Table 7.1 also gives the centrality measure C(p) for each 
generated value of p. In Design 1, this centrality measure became the 
basis for deciding which estimator to use for minimizing risk. Recall 
from Table 5.1 that centrality measures for the prior means of the dis- 
tribution of p given the four values of v are 1.88, .98, .14, and 0, 
respectively. Note, then, in Table 7.1 that centrality measures for 
ranged from 1.39 to 2.00 (the highest possible value). Those for 
ranged from .06 to 1.94; for v^, from .05 to 1.05; and, for v^, from 0 
to .38. Centrality measures for the prior mean of the distribution of p 
given the four values of the perturbed prior 10x[v/10+(.09,.05,-.14)] 
are 1.16, .48, .01, and .09. [Recall Figure 6.2 for perturbed-prior 
means. ] 
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Results from the three robustness studies are reported in the next 
section and concluded in the last section. Appendix 7A gives the 
complete-data (PID=0) risks and estimated risks (with associated standard 
errors) for P I D= 15 and PID=40 for the three estimators, three robustness 
studies, four values of v, and ten Dirichlet probabilities p. Tukey 
data summaries, central values, and spreads were calculated for the risk 
estimates over the ten Dirichlet generations. The averages, with stan- 
dard errors, are given in Table 7A.7. We note here that the posterior 
mean had the smallest average, even at v., when the original prior was 
used in the Bayesian estimators. This result is important because it 
means that the sampling distribution, even though based on only ten 
probabilities, agreed with the theoretical distribution at least in terms 
of which estimator minimized average risk. [Recall Section 1.2.] 

Appendix 7A also gives the analyses-of-variance results for the three 
robustness studies and plots of two of their interactions. 

In the remainder of this section, we briefly discuss computational 
aspects peculiar to Design 2. Since we were investigating which estima- 
tor best minimized risk for quadratic loss, the criterion for choosing 
among estimators was the estimated mean squared error (risk). After we 
discussed estimated mean squared error in Design 1, we studied the 
estimators in detail, especially at those values of p for which two or 
more estimators had approximately equal risk. In Design 2, we studied 
only estimated mean squared error and results from the analysis of 
variance on its natural logarithms. 

In Design 1, we used the regression estimate for the mean squared 
error. Where we could, we also used the regression estimate in Design 2. 
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However, we could not calculate the regression estimate for some cases 
when the prior was v^( . l l ,9.8) . In these cases, the complete-data 
maximum likelihood estimate was the same for all 200 trinomial simula- 
tions. Hence, its sample variance was zero. Therefore, the denominator 
for the regression mean-squared-error estimate in (5.19) was undefined. 

The problem cases were those in which C(p) was 1.9993, 1.9999, and 
2.0000. These three cases were those in which the generated Dirichlet 
probabilities were approximately (0,0,1). The probabilities were 
(.00000 004,. 00011,. 99989), (.00000 9,. 00000 00000 0003,-99999 1), and 
(.00000 3, .00000 00000 001, .99999 7), respectively. Note that there were 
no problems in calculating regression estimates of the mean squared 
error for the generated Dirichlet probability p=(.01,0,.99) , for which 
C(p)=1.9182. Further, for the case in which C(p)=1.9993, the regression 
estimate was undefined for only half the cases. Hence, it was only when 
the population probability was almost identically (0,0,1) that the 
regression estimate did not exist. 

In these cases of undefined regression mse estimate, we used the 
control-variate estimate. However, the control -variate estimate was 
negative several times for the posterior mode when the generated Dirich- 
let probability was approximately (0,0,1). Although this happened only 
in cases where the regression estimate was defined, it happened for a 
Dirichlet probability which had an undefined regression estimate for 
most of the SS, PID, and replication variations. Hence, the control- 
variate estimate was used for most variations and, for consistency, 
would have been a better choice than the regression estimate for the 
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remaining cases. [See the fourth generation for PMDRO in Table 7A. 3 
for inconsistent mean squared errors resulting from use of the two 
different estimates.] The control -variate estimate was negative because 
the posterior mode had a regular mse estimate a couple orders of magni- 
tude smaller than either the true or regul ar-estimate complete-data mean 
squared error and the regular estimate was larger than the true value. 
[See equation (5. 13). ] 

The inconsistent mse estimates for PMDRO at p=(0,0,l) affected 
results in the ANOVA. Variations among PID and SS levels were as large 
as 100. This large variation gave rise to unnaturally large effects of 
SS and, particularly, PID relative to those for estimator. Further, the 
ANOVA model had an additional factor v, ten levels (instead of one) for 
p within \), and p as a random factor instead of a fixed factor. There 
fore, the ANOVA model was more complicated than that for Design 1. 

Hence, its results were more subject to error. 

Therefore, as a precaution against reaching wrong conclusions, we 
studied certain interactions, especially the PwNU interaction (PwNU)xSS 
xPIDxEST, independent of their significant effects in an ANOVA. An ad- 
ditional reason for investigating this particular interaction was that 
we wanted to insure that any lack of significant effect for PID was 
accurate. Even more important, we wanted to know how any lack of 
significant effect related to absence of any change in log e (mse) for the 
two levels of PID. That is, PID could show no significant effect in the 
ANOVA model strictly because the other factors had huge effects relative 
to PID. In this case, there could still be a large change in log e (mse) 
for the two levels of PID. 
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Recall that in the last chapter we found that the Taylor-series 
approximation APM for the posterior mean was usually accurate to at 
least four significant figures. In these cases, the APM mean-squared- 
error estimate was a good approximation for the EPM mean-squared-error 
estimate. It was usually accurate to at least three significant figures. 
The rare cases in which the Taylor-series approximation was not as good, 
however, were cause of concern for how well the APM mse estimate approx- 
imated the EPM mse estimate. These cases occurred several times in 
Design 2 at when the generated Dirichlet probability was approximately 
(0,0,1). However, even though a few of the 200 trinomial simulations 
yielded poor approximations for the posterior mean, the APM mse estimate 
was an unusually good approximation for the EPM mse estimate. It was 
nearly always accurate to at least five significant figures. The reason 
is that in those cases (the majority of the 200 trinomial simulations) 
in which the approximation was not poor, the approximated posterior mean 
agreed extremely well with the exact posterior mean. Therefore, the APM 
mse estimate, an average over the 200 trinomial simulations, was a very 
good approximation. 
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7.2 Results : 

In this section we briefly give results from Design 2. We begin by 
giving, in Table 7.2, the risks averaged over the ten Dirichlet genera- 
tions of p. For estimated risks, those for PID=15 and PID=40, we also 
averaged over the two replications. Note that averaged risks for PID=0 
are not given for the posterior mode (PMD) at v^. As in Design 1, we 
could not analytically calculate the complete-data risk for values of p 
containing one or more very small components when v=(0.1,0.1,9.8) . In 
these cases, a solution to the likelihood equations may not exist in P^. 
[Note that (x..+v..-l)/(n+Zv .-3) is negative when x..=0 if v^=0.1.] If not, 

the posterior mode occurs on the boundary. Hence, p. may equal 0 or 1 
t h 

but the i n solution (2.43) to the likelihood equation can not be used 
to calculate the risk. 

We are interested in how much risk increases as the data becomes 
incomplete. Table 7.2 shows that in 34 out of 44 cases, the averaged 
risk increased between 5% and 12% as the percentage of incomplete data 
(PID) increased from 0 to 15. The highest increase, 20%, was at a sample 
size of 50 for for the posterior mean (APM) when the perturbed prior 
was used in the Bayesian estimators. As the percentage of incomplete 
data increased from 0 to 40, the averaged risk increased between 17% and 
50%. Individual values showed greater variation than the averages given 
in Table 7.2. Occasionally, the complete-data risk was even greater than 
the risk when approximately 15% of the data was incomplete. In these 
cases, however, the complete-data exact value was nearly always within a 
standard error of the PID=15 estimated value. These cases probably occurred 
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when the observed percentage of incomplete data was on the low side of 
15%. [Recall that, for a sample size of 25, when PID=15 the observed 
percentage of incomplete data could be 0%, 4%, 8%, 16%, 20%, or 24%,...] 
Finally, note that, as sample size decreased, the averaged risk decreased 
by roughly one-half. 

To compare the difference between estimators, we divided the averaged 
risk for the posterior mean by that for the posterior mode and that for 
the maximum likelihood estimate. Results given in Table 7.3 show that, 
with small exception, the averaged risk was smallest for the posterior 
mean for all variations in prior parameter, percentage of incomplete 
data, and sample sizes when the correct prior was used in the Bayesian 
estimators. The exception is that the posterior mode had, to two signif- 
icant figures, the same risk for and almost equal risk for [Re- 
call Table 7.2.] As p moved from the center of the P 2 simplex toward 
a corner (from to v^), the advantage in using the posterior mean over 
the posterior mode increased. The advantage in using the posterior mean 
over the maximum likelihood estimate was greatest at the center or a 
corner of P 2 - At Vp the risk of the posterior mean was almost one half 
that for the posterior mode or maximum likelihood estimate for a sample 
size of 25. For other values of the prior parameter v, percentage of 
incomplete, data PID, and sample size SS, the averaged risk of the pos- 
terior mean lay between 70% and 100% of that for the posterior mode and 
maximum likelihood estimate. 

When a uniform prior was used in the Bayesian estimators, the pos- 
terior mode equaled the maximum likelitiood estimate. For this case, 
results of Table 7.3 show that, in terms of averaged risk, the posterior 
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mean was the best estimator at and v^, in the middle and near the 
center of P^- The maximum likelihood estimate (Eposterior mode) was 
the best estimator near a boundary of P^ at and v^. ,The 

maximum difference near the center of P ^ was only 70% (relative to the 
smallest value). At a boundary, however, the averaged risk was between 
two and three times smaller for the maximum likelihood estimate (Eposter- 
ior mode) than for the posterior mean. 

When the perturbed prior 10 x [v/10+( .09, .05,- . 14) ] was used in the 
Bayesian estimators, the posterior mode had the smallest averaged risk, 
except at the center of the ?2 simplex, where the posterior mean was 
slightly better. The largest difference between estimators was at 
where the risk of the posterior mean was 40% larger than that for the 
posterior mode. 

Note that for all three priors (correct, uniform, and perturbed), 
there was very little difference between estimators as the percentage 
of incomplete data changed. As sample size increased, the ratios moved 
toward 1; i.e., the difference between estimators decreased. 

As discussed in Chapter 1, however, we were most interested in 
difference in risk as a function of the individual values of p. To 
investigate this relationship, we first performed an analysis of variance 
on the natural logarithms of the estimated mean squared errors (risks). 
The F values from these analyses are given in Tables 7.8A, 7.9A, and 
7.10A for use of the correct, uniform, and perturbed prior (robustness 
study R0, Rl, and R2), respectively, in the Bayesian estimators. By far 
the most significant effect in all three ANOVAs was that of p within v 
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(Pw.NU), usually followed by (Pw.NU)xEST. Further, in all three robust- 
ness studies, (Pw.NU) x EST x SS was significant at the 1 % level and there 
was a three-way (xPID) or four-way (xPIDxSS) significant interaction of 
(Pw.NU) x EST with PID in each analysis. 

Following the ANOVA tables in Appendix 7A are plots of significant 
or otherwise important (recall Introduction) interactions. These plots 
indicated that there was little change in the difference between estima- 
tors as the percentage of incomplete data (PID) increased from 15 to 40. 
Further, although the difference between estimators decreased as sample 
size increased, the shape of the estimator curves for the two sample 
sizes was nearly the same. Therefore, we summarize results from these 
analyses by giving in Tables 7.4, 7.5, and 7.6 plots of the (Pw.NU) x SS 
X PID X EST interactions for only SS=25 and PID=15. Note that the horizon 
tal axis is the centrality measure of the generated p. The vertical 
axis is log (risk) [slog (mse)]. Recall from Chapter 6 that, because 
we used two replications, the square root of the exponential of a diff- 
erence between logarithms approximately equals the ratio of the risk of 
the two estimators. Thus, any difference of 6 between two estimators in 
the log g scale in Plots 7.4, 7.5, and 7.6 means that one of the two 
estimators had a risk about twenty times larger than that of the other 
estimator. 

There are three important factors to consider in these three plots 
the distribution from which the generated p comes, the value of the gen- 
erated p, and the value of the prior parameters used in the Bayesian 
estimators. In all three plots, the distribution from which p comes is 
the Dirichlet distribution given the prior v. The centrality measure of 
the mean of this distribution is marked on the three plots by the arrow 
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(t) for the four values of v. We call this prior mean the v-p'u.o't mean 
or the correct-prior mean. We call the mean of the prior distribution 
given the prior parameters used in the Bayesian estimators the zAtimoutofi- 
pftLon. mean. The centrality measure of the estimator-prior mean is marked 
on the plots by an "x" when, in Plots 7.5 and 7.6, it differs from the 
v-prior mean. [Recall Figure 6.2 for estimator-prior means.] 

Notice that the closer the v-prior mean is to a corner or to the 
center of the simplex, the tighter the distribution of the generated 
values of p. Away from these points, the distribution is fairly wide; 
for example, the distribution of p given v^ covers almost the entire 
C(p) axis. 

Denote the estimator-prior mean by p. Plots 7.4 - 7.6 show that, 
except for \>y there is a neighborhood of C(p) in which the posterior 
mean is the best estimator for minimizing risk, often followed by an. 
outer one-sided neighborhood toward 2.00 in which the posterior mode is 
best. Finally, in the tails of the distribution of p given the prior 
parameters used in the Bayesian estimators, the maximum likelihood esti- 
mate is best. 

Thus, the posterior mean was the best estimator most of the time. 

In these cases, the posterior mode was usually next best. Other than 
cross-over probabilities, the smallest difference between the posterior 
mode and mean was at the center of the P^ simplex. There, the risk of 
the posterior mean was reduced only 14% to 23% from that of the posterior 
mode, whereas it was reduced 22% to 42% from that of the maximum likeli- 


hood estimate. 
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Except near the tails of the estimator-prior distribution or near 
the p=(0,0,l) corner of P the difference in log e (mse) for the estima- 
tors was usually between -1.4 and 0.8; difference in risk ranged from 0 
to a 50% decrease. Use of the correct estimator most often reduced the 
risk by about one-third. At the tails, the maximum difference between 
log e (mse) for the three estimators ranged from 0.8 (1/3 increase in risk) 
in Table 7.4 at C(p)=0.06 for to 1.9 (risk almost tripled) at C(p) 

=1.05 for 1n Table 7.6 to 5.6 (risk increased more than 16 times) at 
C(p)=1.94 for \>2 in Table 7.6. However, the largest difference between 
estimators occurred for at the corner p=(0,0,l) where C(p)=2.00. At 
this probability p, values of log e (mse) for the maximum likelihood esti- 
mate and posterior mode were equal. The large difference in log e (mse) 
between this value and that for the posterior mean was 10.6, 20.8, and 
18.9 for use in the Bayesian estimators of the correct, uniform, and 
perturbed prior, respectively. These differences correspond to an in- 
crease in risk of 200 times, 33,000 times, and 13,000 times the risk for 
the maximum likelihood estimate or posterior mode. Note, however, that 

this enormous difference occurred only exactly at the (0,0,1) corner. 

-7-3 

For example, the probability p=(.4 ,.l ,.99989) also had, rounded off, 

C(p)=2.00 but the multiplicative increase in risk in using the posterior 
mean instead of the posterior mode was by a factor of 77.5, 992, and 
854, respectively, for the three robustness studies. Thus, the increase 
was huge but not of the order found when the first two components had 
more zeros. As p moved further from the (0,0,1) corner, the difference 
in risks continued to drop sharply. 
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In Figure 7.1 we give the ranges for each value of the estimator- 
prior in which the posterior mean, posterior mode, and maximum likelihood 
estimate was best. Note that the limits sometimes differ slightly from 
the plots. In these cases, the difference between the limit and the 
correct value was small. We used the wrong value to give limits in .05 
increments and to give agreement between slightly different values for 
the limits for those estimator-prior means having centrality measures 
of 0 and .01. [Recall that the estimator-prior mean for all four plots 
in Table 7.5 (use of the uniform prior) has centrality measure C(p) of 
0 as well as one estimator-prior mean in Table 7.4.] Note in Figure 7.1 
that, for the uniform prior, the region in which the posterior mean is 
best is 0^C(p)<.70. Also note that the posterior-mode range for C(p) 
=0.09 was unusually short; that for the maximum likelihood estimate began 
sooner than results from neighboring values of C(p) would indicate. 

Results from Design 2 indicate that if one is even reasonably con- 
fident in the prior, then the best estimator to use is the posterior 
mean unless the prior mean is at the corner of the simplex, in which 
case the posterior mode is better. Hence, we recommend, for an initial 
try, use of the posterior mean if C(p)^1.5; the posterior mode, other- 
wise. 

In practice, one can replace p in Figure 7.1 by the estimator p 
and interpolate in the intervals in Figure 7.1 to refine the estimation 
process. That is, if one uses the prior 3 with prior mean p.=3-/T3- in 

~ J 

an estimator p, then one can compare C(p) with the regions given for 
C(p) to determine if the best estimator was used. If not, then p can 


FIGURE 7.1 


INTERPOLATION TABLE 


if C(p) was 


O a and .01* 3 


.09 


.14° 


and C(p) was 


then best estimator was 


. 48 


.98° 


1.16 


1.88 


0 £ C(p)<.20 
. 20-C ( p )< . 70 
otherwise 

posterior mean 
posterior mode c 
maximum likelihood 

0 £ C(p)<.20 

.20 £ C(p)<.35 

otherwise 

posterior mean 
posterior mode d 
maximum likelihood 

0 £ C( P )< .45 
. 45-C(p)<l .05 
otherwise 

posterior mean 
posterior mode 
maximum likelihood 

0 £ C(p)< .85 
. 85-C ( p )<1 . 60 
otherwise 

posterior mean 
posterior mode 
maximum likelihood 

0 £ C(p)< .10 
.10 £ C(p)<1.45 
1.45-C(p)<1.94 
[ [ 1 . 94-C (p ) — 2 . 00 

maximum likelihood 
posterior mean 
posterior mode 

e 

posterior mode]] 

[[ . 0-C( P )< -30 
[ [ . 30 £ C(p)<l. 25 
1 . 25-C( p )<1 . 55 
1.55 £ C(p}42.00 

maximum likelihood 

g 

posterior mean]] 
posterior mean 
posterior mode 

[[ 0 £ C(p)< .90 
[[.90 £ C(p)<1.25 
1 . 25-C ( p ) < 1 . 55 
1.55 £ C(p) £ 2.00 

maximum likelihood 
posterior mean]] e 
posterior mean 
posterior mode 


See plots in Table 7.4 
^See plots in Table 7.6 

d R i s k of posterior mode differs 
little from that of posterior 
mean or max. likelihood est. 


When uniform prior was used in Bayesian esti- 
mators, best estimator was the posterior mean 
instead of the posterior mode; see Table 7.5 

e 

extrapolated 
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be discarded and the recommended estimator used. For example, if one 
has a prior $=(12,1,2), then the estimator-prior mean p is ( .80, .07, .13) 
which has centrality measure C(p)=( .80- .07)^+( .80- . 13)^+( .07- . 13)^=. 99 . 

Results of Figure 7.1 indicate that if use of the posterior mean gives 

• • 

an estimator p with C(p) between .10 and 1.45, then the posterior mean 
is the best estimator to use. If, however, C(p) is greater than 1.45, 
then we should discard the posterior mean and use the posterior mode. 
Similarly, if C(p) is less than .10, we should replace the posterior 
mean by the maximum likelihood estimate. 

Note that results of Designs 1 and 2 indicate that the maximum 
likelihood estimate, posterior mode, and posterior mean will usually be 
close enough that their centrality measures will differ little. That 
is, C(p) should not differ greatly for the three estimators. Finally, 
we emphasize that the regions in Figure 7.1 are not exact. Further, 
replacing p by the estimator p in Figure 7.1 makes the regions even 
less exact. Hence, regions in Figure 7.1 should be considered only as 
rough guidelines. Even so, their use can still be expected to reduce 
risk by 1/4 to 1/2 in most cases and by substantially more in many cases. 
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7.3 Conclusions : 

Based on results from Design 2 summarized in the last section, we 
revised the operating guideline from Design 1 as shown in the following 
Figure 7.2: 


i 

FIGURE 7.2 

OPERATING GUIDELINES 

I 

l 


Given data z and prior parameter v 
Calculate prior mean p with component p^v^/Zv. 

k k+1 „ 

Calculate C(p}= E E (p.-pj 
~ i=l j>i 1 J 


if 

for estimator p to minimize risk, use 

0-C(p)<l . 50 

posterior mean (Taylor-series approx.) 

1.50-C(p)^2.00 

posterior mode 


calculate C(p) 

compare C{p) with C(p) intervals in Figure 7.1 for prior p 

if C(p) is not in recommended interval, recalculate 
estimator as recommended in Figure 7.1 
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The gain in using the estimator recommended by these procedures is 
usually a 1/4 to 1/2 reduction in risk. In many cases, however, the 
reduction can be very large. The largest reduction in risk in this study 
occurred when p=(0,0,l). For this corner probability, the risk of the 
posterior mean was as much as 33,000 times larger than the risk for the 
posterior mode or maximum likelihood estimate. 


TABLE 7.1 
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Note that the three C(p) values of 2.00 correspond to 1.9999. for p=(.9 
and 2.0000 for p=(.3 ^,.1 ^,1.), respectively. 
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Values are the average over two replications in addition to an average over ten Dirichlet generations. Risk is estimated 
by the regression or control-variate estimate (see Tables 7A.1 - 7A.6 and Sections 5.9 and 7.1). 

f 

Values were not calculated (see main text) 
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l Ratios are of averaged risks given in Table 7.2 

z Prior used in Bayesian estimators APM (Tayl or-series approximate posterior 
mean) and PMD (posterior mode) 

3 For uniform prior, PMD=MLE 
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TABLE 7.4 

PLOT OF (Pw.NU)*EST*SS*PID INTERACTION 1 . SS=25, PID=15. ORIGINAL PRIOR USED IN BAYESIAN ESTIMATORS. 


t APM 
• PMD 
A MLE 



C(P) 


lo 9 e 

mse 




C(p) 


Values are sums over replication. Arrow (t) denotes centrality measure of expected value of p given v [See Table 5.1]. 

2 

Horizontal axis for the three sets of values plotted at 2.0 for v. Is rescaled to have values 1.9993, 1.9999,2.0000. Note that vertical 
axis Is also rescaled. 
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TABLE 7.5 

PLOT OF (Pw.NU)*EST*SS*PID INTERACTION 1 . SS-25, PID-15. UNIFORM PRIOR USED IN BAYESIAN ESTIMATORS. 


t APM 
A PMDsMLE 



0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 


C(p) 



0 .1 .2 .3 .4 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 


C(p) 


‘Values are suns over replication. Arrow (t) denotes the centrality measure of the expected value of p given v [see Table 5.1]. Cross (x) 
denotes the centrality measure of the expected value of p given the uniform prior (1,1.1) (see Table“5.2 andTIgure 6.2]. 
horizontal axis for the three sets of values plotted at 2.0 for v. Is rescaled to have values 1.9993, 1.9999, 2.0000. Note that vertical 
axis Is also rescaled. 







-302- 


APPENDIX 7A 


DATA FOR DESIGN 2 



RISK FOR MLE, THE MAXIMUM LIKELIHOOD ESTIMATE 
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Risk is estimated by the regression estimate if the regression coefficient b exists (i.e.; b^number/zero) . Otherwise* risk is estimated 
by the control -variate mse estimate. [See Sections 5.9 and 7.1.1 Undefined regression estimate occurs in some cases when generated 
Dirichlet probability p approximately equals (0,0,1). 



TABLE 7A.2 

RISK FOR APMRO, THE TAYLOR-SERIES APPROXIMATE POSTERIOR MEAN FOR ROBUSTNESS STUDY 0 (CORRECT PRIOR USED IN BAYESIAN ESTIMATORS) 
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Note that values are given in scientific notation; e.g., ,0079 is written as .79-2. Values in parenthesis are standard errors. 



RISK FOR PMDRO, THE POSTERIOR MODE FOR ROBUSTNESS STUDY RO (CORRECT PRIOR USED IN BAYESIAN ESTIMATORS) 


- 305 - 


co n m n ^ 


- cm cm cm cm co co 




m nnpTroronnfMn r> n (o n ci m n n n pi 


«>p«.wonin^»t*-no 
rv rs io 10 r» mo co oo r» 


rH r-fN-lr 


CM _t 


co PT ro rt c 


«j (o m fo pj <*> m pi c\j pi nP)P)P)p>P)nopm 


PTPTPTP1P)P1P)P)P)P1 


CMCOU3r>.CMCMCMCMCOio 


JCMCM«-<CMCMCM4-*— 4 •— « »— • •— < >— * *— < »-H 

lonowoui 


CM _ _ -H 


PT PTP1 P) q- 


in pi pt n pi pi p) pi (vj pi pi pt pi pi pi n pi p) n pi piptpi PiPiPiPipiP) pi 


(MPU£)Pv(\Jt\JCMC\jrMO 


PO CM CM CM ' 


CM CM CM *— < ■" 
ld • err oO *■ 


r— « ■— « CM •— < CM ' 


M PI P - ) P) ^ 


in P> P) P) CM PI CO CO CM CO PO P) P) Pi P5 P) PT P) P>) PI 


Pi Pi Pi Pi Pi Pi CO PO CO CO 


CM Pi CO CO CM CM CM CM CO CO 
I » I I • I I I I • 

cooirs^mo^tniOKi 

m m p) «» 10 m p> Pi p< 


J CM CM •— < CM CM CM *— I •- 




• CM CM CM CM CM CM CM CM CM CM 


> CM CM CM •—* CM C\J CM — < • • — > CM CM CM CM *• 




pi co Pi p> m 


f CO CO CO CM p> PO CO CM CM CM CM CM CO CM CM CM CM CM CM CM P> CM CM CM CM CO CM CM CM 


1 cO Lfi •— < CM CM CM CO lO 


3 CO f*. CO CM 


> CM CM — • *-i «-< CM CM r-i r-« r— 4 p"4 *—* ■" 


CO CO CO 


-HPliOUi-HCMCMNC 


CO CO CM CM co PO CO CM CM CM P 


CM CM PO CM CM CM CM CM 
I I I I I I I I 
CM • — 1 CO * 3 - CM LCi CM O 


CM CM Pi CM CM CM CM CM CM CM 


PI Pi Pi Pi C 


^ CO PO CO CM CO CO PO CM CM 


CM CO CO CO CO CM PO CM CO CM 
Pi Pi Ui P) O -H Pi r-MD ID 


1 Pi Pi CO PO Pi PO CM CM Pi 


H CM CM CM Pi CO PI CM CM r 


-4 CM CM CM •— • • ■— • r 


'.r^.cOCTir^.r-.CMCncTY 


CO LO PiPlPlPI^ 

ajinOOCMYj rs ai 


M-PiPlPiCMPlPlPICMCM 


CM PO Pi CO co PO PO CM ro CM 


p> pi pi pi n Pi co pi cm Pi 


4 CM CM CM CO 10 


i CM CM 4— f t— ( f — 4 CM CM *— * *—4 >-4 *-4 - 

U'COPIP'U 


4P1 OfflPi 


■4 CO CO CO CD CO CO ' O' On 


— • CM CM CM - 


4 CM CM CM —4 r- 


OiioiomaorxiooicON 


4 wen •? in-io n a) 0*0 -hcmpiktuhoncooio 


4 dj co o in in i — coctyo ^cuPiominrscooio 


Risk Is estimated by the regression estimate if the regression coefficient b exists (i.e.; bj*number/zero) . Otherwise, risk is estimated 
by the control-variate mse estimate. (See Sections 5.9 and 7.1.] Undefined regression estimate occurs in some cases when generated 
Dirichlet probability p approximately equals (0,0,1). 
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Risk Is estimated by the regression estimate [see Section 5.9] 
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Note that values are given in scientific notation; e.g., .0065 is written as .65-2. Values in parenthesis are standard errors. 
Risk is estimated by the regression estimate [see Section 5.9] 
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Risk is estimated by the regression estimate if the regression coefficient b exists {i.e.; b^number/zero). Otherwise, risk is estimated 
by the control -variate mse estimate. [See Sections 5.9 and 7.1.J Undefined regression estimate occurs in some cases when generated 
Dirichlet probability p approximately equals (0,0,1). 



Mean Over 10 Dirichlet Simulations for Quadratic-Loss Estimated MSE. 
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TABLE 7A.8A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED QUADRATIC-LOSS 
MEAN SQUARED ERRORS FOR ROBUSTNESS SET 0 (ORIGINAL PRIOR IN ESTIMATORS) 


SOURCE 

D.F. 

SUM OF SO. 

MEAN SO. 

F 

NU 

3 

.251377 

4 

.837924 3 

10.499 *** 

SS 

1 

.896719 

2 

.896719 2 

467.367 *** 

PID 

1 

.644600 

1 

.644600 1 

607.009 *** 

EST 

2 

.156130 

2 

.780652 1 

1.879 

NUxSS 

3 

.366810 

0 

.122270 0 

.637 

NUxPID 

3 

.255790 

0 

.852634 -1 

8.029 *** 

NUxEST 

6 

.527583 

2 

.879306 1 

2.116 * 

SSxPID 

1 

.124512 

-1 

.124512 -1 

.118 

SSxEST 

2 

.109549 

1 

.547744 0 

4.703 ** 

PIDxEST 

2 

. 156421 

0 

.782106 -1 

14.588 *** 

NUxSSxPID 

3 

. 306224 

0 

.102075 0 

.967 

NUxSSxEST 

6 

.348970 

0 

.581616 -1 

.499 

NUxPIDxEST 

6 

.921781 

-1 

.153630 -1 

2.866 ** 

SSxPIDxEST 

2 

. 181913 

0 

.909564 -1 

.901 

NUxSSxPIDxEST 

6 

.654974 

0 

.109162 0 

1.081 

Pw.NU 

36 

.287310 

4 

. 798085 2 

1672.306 *** 

(Pw.NU)xSS 

36 

.690718 

1 

.191866 0 

4.020 *** 

(Pw. NU)xPID 

36 

.382294 

0 

.106193 -1 

.223 

(Pw.NU)xEST 

72 

.299159 

3 

.415499 1 

87.064 *** 

(Pw.NU)xSSxPID 

36 . 

.380123 

1 

.105590 0 

2.213 *** 

(Pw.NU)xSSxEST ’ 

72 

.838485 

1 

.116456 0 

2.440 *** 

(Pw.NU)xPIDxEST 

72 

.386013 

0 

.536129 -2 

.112 

(Pw.NU)xSSxPIDxEST 

72 

.727218 

1 

.101002 0 

2.116 *** 

ERROR 

480 

.229073 

_2 ' 

.477236 -1 


TOTAL 

959 

.590404 

4 




* Significant at. 10% level. 

** Significant at 5 % level. 

*** Significant at 1% level. 

Note that the usual exponential notation is used for the third and fourth columns 
for example, 5904.04 is written as .590404 4. 


INTERACTION* 


t APM 
• PMD 
A MLE 

=r^=r^. P 10=15 



NU4 
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TABLE 7A..9A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED QUADRATIC-LOSS 
MEAN SQUARED ERRORS FOR ROBUSTNESS SET 1 (UNIFORM PRIOR IN ESTIMATORS) 


SOURCE 

D.F. 

SUM OF SQ. 

MEAN SQ. 

F. 

NU 

3 

.723959 3 

.241320 3 

11.557 *** 

SS 

1 

.865921 2 

.865921 2 

3702.584 *** 

PID 

1 

.678313 1 

.678313 1 

1111.425 *** 

EST 

1 

.103706 3 

.103706 3 

6.492 ** 

NU*SS 

3 

.129282 1 

.430939 0 

18.426 *** 

NUxPID 

3 

.205667 -1. 

.685557 -2 

1.123 

NUxEST 

3 

.317635 3 

.105878 3 

6.628 *** 

SSxPID 

1 

.105919 -2 

.105919 -2 

.344 

SSxEST 

1 

.185452 0 

.185452 0 

9.239 *** 

PlDxEST 

1 

.818075 -2 

.818075 -2 

1.769 

NUxSSxPID 

3 

.222064 -2 

.740214 -3 

.241 

NUxSSxEST 

3 

.160116 1 

.533721 0 

26.588 *** 

NUxPIDxEST 

3 

.172753 0 

.575842 -1 

12.449 *** 

SSxPIDxEST 

1 

.169200 -3 

.169200 -3 

.280 

NUxSSxPIDxEST 

3 

.480949 -2 

.160316 -2 

2.654 * 

Pw.NU 

36 

.751737 3 

.208816 2 

6623.127 *** 

(Pw.NU)xSS 

36 

.841929 0 

.233869 -1 

7.418 *** 

(Pw.NU)xPID 

36 

.219711 0 

.610310 -2 

1.936 ** 

(Pw.NU)xEST 

36 

.575080 3 

.159744 2 

5066.700 *** 

(Pw.NU)xSSxPIO 

36 

.110764 0 

.307677 -2 

.976 

(Pw.NU)xSSxEST 

36 

.722642 0 

.200734 -1 

6.367 *** 

(Pw.NU)xPIDxEST 

36 

.166521 0 

.462560 -2 

1.467 * 

(Pw.NU)xSSxPIDxEST 

36 

.217426 -1 

.603960 -3 

.192 

ERROR 

320 

. .100891 1 

.315283 -2 


TOTAL 

639 

.257187 4 




* Significant at 10% level. 

** Significant at 5% level. 

*** Significant at 1% level. 
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TABLE 7A.10A 

ANALYSIS OF VARIANCE FOR NATURAL LOGARITHMS OF ESTIMATED QUADRATIC-LOSS 
MEAN SQUARED ERRORS FOR ROBUSTNESS SET 2 (PERTURBED PRIOR IN ESTIMATORS) 


SOURCE 

D.F. 

SUM OF SQ. 

MEAN SQ. 

F 

NU 

3 

.195478 

4 

.651592 3 

13.159 *** 

ss 

1 

.980437 

2 

.980437 2 

1250.280 *** 

PID 

1 

.797488 

1 

.797488 1 

774.940 *** 

EST 

2 

.964996 

2 

.482498 2 

4.445 ** 

NUxSS 

3 

.102821 

0 

.342735 -1 

.437 

NUxPID 

3 

.184583 

0 

.615277 -1 

5.979 *** 

NUxEST 

6 

.251424 

3 

.419040 2 

3.860 *** 

SSxPID 

1 

. 143585 

-1 

.143585 -1 

1.827 

SSxEST 

2 

.540025 

0 

.270013 0 

8.854 *** 

PIDxEST 

2 

.453269 

-1 

.226634 -1 

3.441 ** 

NUxSSxPID 

3 

.153708 

-1 

.512360 -2 

.652 

NUxSSxEST 

6 

.353239 

0 

.588731 -1 

1.931 * 

NUxPIDxEST 

6 

.725311 

-1 

.120885 -1 

1.835 

SSxPIDxEST 

2 

.127573 

-1 

.637866 -2 

1.511 

NUxSSxPIDxEST 

6 

.333265 

-1 

.555442 -2 

1.315 

Pw.NU 

36 

.178257 

4 

.495159 2 

7138.288 *** 

(Pw.NU)xSS 

36 

.282303 

1 

.784174 -1 

11.305 *** 

(Pw.NU)xPID 

36 

.370475 

0 

.102910 -1 

1.484 ** 

(Pw.NU)xEST 

72 

.781618 

3 

. 108558 2 

1564.988 *** 

(Pw.NU)xSSxPlD 

36 

.282862 

0 

.785727 -2 

1.133 

(Pw.NU) x SS x EST 

72 

.219563 

1 

.304948 -1 

4.396 *** 

(Pw. NU) X PID X EST 

72 

.474228 

0 

.658649 -2 

.950 

(Pw.NU)xSSxPIDxEST 

72 

.304022 

0 

.422252 -2 

.609 

ERROR 

480 

.332960 

J_ 

.693667 -2 


TOTAL 

959 

.498407 

4 



* Significant at 

10% level. 





** Significant at 

5% level . 





*** Significant at 

1% level. 
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CHAPTER 8 


SUMMARY AND CONCLUSIONS 

In this thesis we considered simultaneous estimation of the vector of 
multinomial cell probabilities p from incomplete data, incomplete in that 
it contains partially classified observations. Each such partially classi- 
fied observation is observed to fall in one of two or more selected cate- 
gories but is not classified further. The estimation criterion was mini- 
mization of risk E[L(p,p)] for quadratic loss L(p,p)=(p-p) 1 (p-p) for the 
estimator p of p. 

The estimators considered were the classical maximum likelihood esti- 

A 

mate p and the Bayesian posterior mean p and posterior mode p. We chose 
the maximum likelihood estimate because it is frequently used in practice. 
In particular, the maximum likelihood estimate is often used when one has 
no prior information. Further, Johnson (1971) proved that the complete- 
data maximum likelihood estimate is admissible; that is, no other estima- 
tor can have smaller risk everywhere. The complete-data maximum likelihood 
estimate is admissible because it has very small risk at the corners of 
the simplex. We chose the posterior mean because it minimizes expected 
risk; hence, it must be best for at least some values of p. We chose the 
posterior mode because it is an in-between estimator. Like the maximum 
likelihood estimate, it is a mode and can have zero components for a non- 
zero prior. Like the posterior mean, it can incorporate prior information. 

A final reason for choosing these three estimators was that the max- 

A 

imum likelihood estimate p, posterior mode p, and a Taylor-series approx- 
imation p of the posterior mean (discussed below) can all be evaluated 
by the EM algorithm of Dempster, Laird, and Rubin (1977). This was im- 
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portant because these three estimators each constitute a nonlinear system 
of k equations in k unknowns, for which the number of solutions may range 
from zero to infinity. Further, as illustrated in Section 4D.5, any roots 
that do exist need not be in P^. Finally, when roots do exist in there 

can be difficulty in finding that one for which the likelihood is a maxi- 
mum. However, Dempster, Laird, and Rubin (1977) proved that if the eigen- 
values of the covariance matrix of the complete-data sufficient statistics 


are bounded above zero, then the EM iterative algorithm converges in P^ to 
a local maximum. A global maximum is then found by choosing that root in 


P, that maximizes the likelihood function 
k 


k+1 

n 

i=l 


z.+a. 




D 


where p.. denotes one of the three estimators p, p, and p and where a^=0 
for the maximum likelihood estimate and a.=v..-l for the posterior mode 
and Taylor-series approximate posterior mean. 

We showed these three estimators to be approximately equal in large 
samples. To compare these estimators in small- and medium-size samples, 
we used two Monte-Carlo simulation studies restricted, because of cost 
constraints, to samples from the trinomial distribution. In the studies, 
samples were of size 25 and 50, percentages of incomplete data varied 
around 15 and 40, and probabilities ranged from the center of the P^ sim- 
plex to one of its corners. In the first simulation study, we chose the 
mean of the prior distribution, given one of four prior parameters, as 
the probability to be estimated. In the second study we randomly generat- 
ed ten probabilities from the Dirichlet distribution given each of the 
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four prior parameters. For each probability, in both studies, we then 
generated 200 sets of complete and incomplete trinomial data from which 
an estimate of risk was calculated. Because the prior is not known in 
practice, we also explored how robust results were to use of the correct 
prior in calculating the Bayesian estimators. Besides the correct prior, 
we also used the uniform prior and a perturbed prior in the calculations. 

Results indicated that an important factor in determining which es- 
timator was best was the position of p in the simplex; in particular, 
whether p was at a corner or in the center of P^. Another important fac- 
tor was the relationship between the probability p being estimated and 
the prior parameters 8 used in the Bayesian estimators. We studied this 
relationship in terms of the difference between p and the mean p of the 
prior distribution given 8. The most satisfactory measure of this dif- 
ference was the difference in the linear centrality measures C(p) and 

2 3 2 

C(p) of p and p, respectively, where C(p)= Z Z (p.-p.) . Results in- 

~ i=l j>i J 

dicated that, except at a corner p=(0,0,l), when the centrality measure 
C(p) was within a fairly wide range of C(p), then the posterior mean was 
best. If the difference between the two centrality measures was very 
large, then the maximum likelihood estimate was best. If the difference 
was between moderate and very large, the posterior mode was often best 
when the probability being estimated was toward a corner of P^. At the 
p=(0,0,l) corner, the posterior mode or maximum likelihood estimate was 
always far better than the posterior mean. 
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Based on these results, in Section 7.3 we recommended rough operat- 
ing procedures to guide a practioner in choosing which estimator to use 
for his data and estimated prior parameters. 

Risk was usually reduced by one-third to one-fourth when the best 
estimator was used instead of the next best estimator and by one-half to 
one-third when the best estimator was used instead of the worst estimator. 
However, the reduction was sometimes substantial. Further, the reduction 
in risk at the corner probability p=(0,0,l) was huge; the risk of the pos- 
terior mean was as much as 33,000 times larger than the risk for the pos- 
terior mode or maximum likelihood estimate. [The risk of the maximum like- 
lihood estimate and posterior mode were equal at p=( 0,0,1). ] As soon as 
one moved even slightly away from the corner, however, the risk difference 
dropped sharply. 

As noted, the posterior mean was the best estimator most of the time. 
In these cases, the posterior mode was usually next best. Other than 
cross-over probabilities, the smallest difference between the posterior 
mode and mean was at the center of the simplex. There, the risk of the 
posterior mean was reduced only 14% to 23% from that of the posterior mode; 
whereas the reduction in risk from that of the maximum likelihood estimate 
ranged from 22% to 42%. 

As the percentage of incomplete data increased from 0 to near 40, the 
risk of the three estimators did not greatly increase and the relationship 
among the estimators changed little. As sample size increased, risk and 
the difference between estimators usually decreased. 

Because numerical evaluation of the exact posterior central moments 
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is generally unfeasible, we also developed approximations for elements 
of the posterior mean and covariance matrices. The best of three approx- 
imations considered for the posterior mean was based on a first-order 
Taylor-series expansion of the exact posterior mean, which we accordingly 
called the Taylor-series approximate posterior mean p. Approximations 
used for elements of the posterior covariance matrix were also based on 
first-order Taylor-series expansions. An important property of the Tay- 
lor-series approximations is that, as the percentage of incomplete data 
goes to zero, they go to the exact posterior moments. In addition, the 
relationship between the Taylor-series approximate posterior mean and 
the posterior mode parallels their complete-data relationship. That is, 


the Taylor-series approximate posterior mean for a Dirichlet density with 
prior parameters (v^, • • • ;v k+ ^) equals the posterior mode for a Dirich- 
let density with prior parameters (v^+1 ,* • • ,v^+l ;V k+l + ^ ' 

To determine the accuracy of the Taylor-series approximate posterior 


mean, we first found that the Taylor-series expansion of the exact poster- 
ior mean had accuracy of magnitude 0(n *). Because terms in the expansion 


were then approximated, the final approximation was not necessarily 
accurate to order 0(n ^). However, we showed that this approximation 


asymptotically equals the exact posterior mean. Further, we gave two 
conditions which guarantee that the error between the exact posterior 
mean and an iterative solution of the Taylor-series approximate posterior 
mean is of magnitude 0(n"^). The two conditions, given by Lemma 4E.1, 


concern the region in which the initial iterative estimate is chosen and 
a bound on the partial derivatives of the Taylor-series approximation. 
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If a neighborhood | J p - p j ( 3o < P , for p>0, of the exact posterior mean p can 
be found such that for all probabilities p in this neighborhood 

k 

max £ | 3g. (p)/3j3 . | - A < 1, 

1-i-k j=l 1 ~ J 
for 

. . . k+1 

g (p)=(z +v.+ £ z p /p )/(n+ £ v ), 

1 ~ 1 1 D3i U 1 U h=l h 

and if an initial iterative estimate p.^ is chosen within the inner 
neighborhood || p-p || qo <Pq-p- 6/ ( 1- A ) where 6 is a bound on the error in 
approximating the exact posterior mean by a first-order Taylor series, 
then the iterative solution to the defining equations of the Taylor-series 
approximate posterior mean p will converge to within 0(n~l) of the exact 
posterior mean. 

If a neighborhood of the exact posterior mean can be found in which 
the X bound is satisfied, then for large enough sample sizes, the second 
condition can be satisfied by choosing an initial iterative estimate with- 
in the first neighborhood. Even for medium-size samples, the inner neigh- 
borhood is almost as large as the outer neighborhood if the percentage of 
incomplete data is moderate. In Appendix 4E, we showed how to determine, 
in practice, whether the second condition can be expected to hold. 

As for the condition for the EM algorithm, the conditions of Lemma 
4E.1 need not be met. In fact, there may not even exist any neighborhood 
of the exact posterior mean in which the X bound holds, as we illustrated 
for an 11-dimensional multinomial problem. However, Appendix 4E showed 
that this was not the case for incomplete trinomial data; there does exist 
a root in P ^ of the Taylor-series approximate posterior mean that differs 
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from the exact posterior mean by magnitude 0(n~^). However, this root 
need not be unique in P^; hence, finding it can be difficult. In these 
cases in and in higher dimensions, because the complete-data relation- 

ship between the posterior mode and posterior mean was paralleled by the 
relationship between the posterior mode and the Taylor-series approximate 
posterior mean for incomplete data (i.e., the Taylor-series approximate 
posterior mean can be written as a posterior mode), we intuitively expect 
that that root that is in the guaranteed-convergence region of the exact 

posterior mean, or at least the closest root to p, is given by whichever 

k+i; z.+v.-l . z 

root in P, maximizes the likelihood function n p. np n . 

k i=l 1 D U 

Finally, we gave examples showing that Lemma 4E.1 gives extremely 
conservative bounds on the error between the exact posterior mean and the 
converged iterative estimate and on the region in which an initial itera- 
tive estimate can be chosen so that successive iterates converge to with- 
in a smal 1 error of p. 

Approximations used for elements of the posterior covariance matrix 

-3/2 

were based on Taylor-series expansions that were accurate to order 0(n ). 

When the iterative solution for the Taylor-series approximate posterior 

mean has accuracy of magnitude 0(n ^), then the Taylor-series approximate 

posterior variance and covariance can be evaluated noniteratively to have 

-3/2 

accuracy of magnitude 0 ( n ). These approximations can also be evalu- 

-3/2 

ated iteratively. However, insurance of accuracy of magnitude 0 ( n ) 
then depends on satisfaction of the two conditions of Lemma 4E.1, where 
g(p) is replaced by the proper function. 

In the same Monte-Carlo simulation used for the risk study, the 
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Taylor-series approximation for the posterior mean was usually accurate 
to at least four significant figures; that for the posterior variance, 
to at least three significant figures; and that for the posterior covar- 
iance, to at least two significant figures. In practice, the Taylor- 
series approximations will generally be more accurate than numerical 
evaluation of the corresponding exact posterior moments. 

Note that, although the maximum likelihood estimate and posterior 
mode asymptotically equal the exact posterior mean (and, hence, the Tay- 
lor-series approximate posterior mean), neither was a good approximation 
of the exact posterior mean in the small- and medium-size samples studied 
in the simulation. Further, as the percentage of incomplete data goes to 
zero, neither go to the exact posterior mean. Finally, neither relate to 
the posterior mode in the same manner that the complete-data posterior 
mean relates' to the complete-data posterior mode. 

Among areas for future work are extensions of the simulation study 
to (1) more priors for the distribution of the data and for use in the 
Bayesian estimators, (2) investigation of the use of the linear central- 
ity measure C(p), and (3) higher dimensions on P^. 

Between Design 1 and Design 2, nearly all types (corner, noncorner 
boundary, center, and in-between) probabilities were covered in the sim- 
ulation studies. We do not expect different results for different values 
of the same type of probability. For example, we expect results for the 
probability (1,0,0) to be similar to those for the corner probability 
(0,0,1). One type of probability not covered was the middle of a side; 
e.g., ( .00, .51, .49) . However, this probability is further from a corner 
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than were the side probabilities ( .00, . 15, .85) , ( .04, .00, .96) , (.00, .07, 

.93), and (. 18, .00, .82) that were included in Design 2. Therefore, we 

expect the posterior mean to be the best estimator for a middle-of-a- 

side probability for even more values of the prior parameter 8 used in 

the Bayesian estimators than were for these four. The effect of the size 

of the "prior-sample size" relative to the size n of the current data 

sample was also thought to be adequately addressed. If the ratio £v./n 

J 

is much smaller, then the prior will have little effect on results. If 
the ratio is much larger, then the data will have little effect. It 
might, however, be valuable to look at more types of priors. For example, 
why were the results for the posterior mode when C(p)=.09 in Design 2 
[see Figure 7.1 and plot in Table 7.6] inconsistent with results for 
the posterior mode for neighboring values of C(p)? Was this inconsistency 
because probabilities near the center of were more sensitive to use 
of wrong priors than probabilities elsewhere in P^? [Recall the tightness 
of the prior distribution of p given v^=(10/3, 10/3, 10/3) . ] 

To examine risk as a function of individual values of p, we used the 
linear centrality measure C(p). This measure reduces a probability in 
essentially two-dimensional space to one dimension. Thus, there are many 
probabilities p that map into one value of C(p). It could be that the 
values of risk for these many probabilities differ greatly. If so, then 
C(p) would not be useful for measuring risk as a function of p; in partic- 
ular, for describing the relationship between risk, the value of the 
probability being estimated, and the prior used in the Bayesian estima- 
tors. For those probabilities that were studied in however, C(p) was 
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a very good measure, as evidenced by plots in Tables 7.4 - 7.6. Risk was 
a smooth function of C(p) and for nearly all values of p that had the 
same C(p), the risk, for a given estimator and prior, was approximately 
the same. A slight exception did occur, however, for the posterior mode 
and posterior mean at for the probabilities p=( .23, .42, .35) and p=(.08, 
.61, .31), both having C(p)=.44, when the correct prior was used in the 
Bayesian estimators. [See plots in Table 7.4 and p,. and p^ in the 
plot in Table 7A.9; however, note that the risk was the same for these 
two probabilities when the perturbed and uniform priors were used. Hence, 
the unequal results when the correct prior was used could be due to a poor 
estimate of risk for one of these probabi lties . ] Thus, there might be 
other problems in using C(p) in that were n °t encountered in this study. 
Would there be any problems in using C(p) in higher dimensions? A good 
linear measure of p is even more important in higher dimensions, where 
risk could otherwise be much more difficult to relate to p in a simple 
manner. Note that, in P^, C( p) was a much better measure of p for use 
in analyzing risk than was the maximum, minimum, component differences, 
absolute component differences, or component-squared sums. Either the 
relationship between risk and these other measures was less smooth than 
that with C(p) [recall plots in Tables 7.4 - 7.6] or, unlike with C(p), 
usually more than one value of risk corresponded to one value of these 
measures. 

We are especially interested in how results from the simulation 
study carry over to higher dimensions. However, note that several numer- 
ical problems found in this study are likely to be even worse in higher 
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dimensions. There will almost surely be more multiple roots of the defin- 
ing equations for the estimators. If there are more in P^, then there 
will be greater difficulty locating the global maximum. More initial 
iterative estimates will have to be tried to insure that all local maxi- 
mum are found and then each of these local maximum will have to be check- 
ed to see if it is the root that maximizes the likelihood. Since P^ be- 
comes increasingly large as k increases, the search for all local maximum 
could be long. Hence, study is needed to examine the roots found by the 
EM algorithm. Are there many in P^ or are all but one outside of P^? 

For incomplete trinomial data in Appendix 4D, there was one and only one 
root in P^ out of three to five roots for the maximum likelihood estimate 
(asymptotic posterior mean), excluding the root (0,0,1) which was elimi- 
nated upon consideration of the data. 

Since there are more components to a probability in P^, convergence 
problems may increase. Finding an initial iterative estimate that has 
each component close to the corresponding component of p is more diffi- 
cult in higher dimensions; e.g., trying to approximate 11 components 

entails more error than trying to approximate only two components. Under 

k+1 

what conditions is v./ £ v. from the estimated prior or, in many cases, 

1 j=l J 

z.+ 2 z ( z . / £ z h ) a good initial iterative estimate? Thus, how sensi- 
1 D3i u 1 j€D u 

tive to the initial iterative estimate is convergence of the EM algorithm 
in higher dimensions? How does the number of iterations increase with an 
increase in the number k of dimensions? Are there more problems in higher 
dimensions satisfying the conditions guaranteeing that the EM algorithm 
will converge to a local maximum in P^? 
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Similarly, there may be more problems in approximating the exact 
posterior mean in higher dimensions. We showed by example in Chapter 4 
that in higher dimensions it will be increasingly difficult to find a re- 
gion of the exact posterior mean in which an initial iterative estimate 
picked guarantees convergence of the EM algorithm to within a small error 
of the exact posterior mean. However, we also showed by examples in Ap- 
pendix 4E that this lemma gives extremely conservative bounds on the 
guaranteed-convergence region. Initial iterative estimates were picked 
far outside the guaranteed-convergence sphere and the EM algorithm still 
converged to the exact posterior mean within the same small error. How 
much does the conservatism of the guaranteed-convergence region carry 
over to higher dimensions? In particular, when there does not exist a 
guaranteed-convergence region, are there any initial iterative estimates 
for which the EM algorithm will converge to the exact posterior mean with- 
in a small error? If the Taylor-series approximate posterior mean is a 
poor approximation in higher dimensions, can a good approximation be found? 
As illustrated in Section 2.2.4, as the number of dimensions increases, the 
exact posterior moments become increasingly expensive to evaluate. Thus, 
good approximations become increasingly important. Finally, when multiple 
roots of the defining equation of the Taylor-series approximate posterior 
mean exist in P^, is, as speculated, the root that is closest to the exact 
posterior mean that root that maximizes the likelihood function? 

Finally, we assumed in this work (recall Section 1.2) that all incom- 
plete data was incomplete at random. Another area of study, therefore, 
concerns incomplete data where the incompleteness of an observation is not 
random but instead depends on the value that would have been observed. 
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